# Hight-Dimmensional Metrics in Julia

## Introduction

## How to Get Started

In [44]:
# ]add HDMjl
using CSV, DataFrames
function r_data(n = 1)
    n_m = "r_" * string(n) * ".csv"
    dta = CSV.read(n_m, DataFrame)
    return dta
end

r_data (generic function with 2 methods)

## Prediction Using Approximate Sparsity

In [45]:
using Random, Distributions
include("..\\src\\HDMjl.jl")
# pwd()



Main.HDMjl

In [64]:
## 32 A Joint Significance test for Lasso Regression
# Random.seed!(12345)
# n = 100
# #sample size
# p = 100
# # number of variables
s = 3
# # nubmer of variables with non-zero coefficients
# X = rand(Normal(), (n, p))
beta = vcat(fill(3, s), zeros(p - s));
# Y = X * beta + randn(n);
dta = r_data(1)
n, p = size(dta)
p = p-1
X = dta[:, Not(1)]
Y = dta[:, 1];


In [72]:
lasso_reg = HDMjl.rlasso(X, Y, post = false)
# use lasso, not-Post-lasso
# lassoreg = rlasso(X, Y, post=false)
sum_lasso = HDMjl.r_summary(lasso_reg, all = false)
# can also do print(lassoreg, all=false)


    Post-Lasso Estimation: false
    Total number of variables: 50
    Number of selected variables: 3
    ---
     
 [1m Variable  [0m [1m Estimate  [0m
  Intercept   0.0307194
  V2          4.40362
  V3          4.33222
  V4          4.39125

    ----
    Multiple R-squared: 0.969344147763692
    Adjusted R-squared: 0.9683861523813074
    

In [85]:
yhat_lasso = HDMjl.r_predict(lasso_reg)
#in-sample prediction
Xnew = rand(Normal(), (n, p))
# new X
Ynew = Xnew * beta + randn(n)
#new Y

dta11 = r_data(1.1)
Xnew = Matrix(dta11[:, Not(1)])
Ynew = dta11[:, 1]
# HDMjl.r_predict()
yhat_lasso_new = HDMjl.r_predict(lasso_reg, xnew = Xnew)
#out-of-sample prediction
post_lasso_reg = HDMjl.rlasso(X, Y, post = true);
#now use post-lasso
HDMjl.r_summary(post_lasso_reg, all = false)
# lasso_reg


    Post-Lasso Estimation: true
    Total number of variables: 50
    Number of selected variables: 4
    ---
     
 [1m Variable  [0m [1m Estimate   [0m
  Intercept   0.00223374
  V2          4.98173
  V3          5.01485
  V4          5.02564
  V22         -0.443961

    ----
    Multiple R-squared: 0.9871727371309015
    Adjusted R-squared: 0.9866326418522026
    

In [86]:
yhat_post_lasso = HDMjl.r_predict(post_lasso_reg)
#in-sample prediction
yhat_post_lasso_new = HDMjl.r_predict(post_lasso_reg, xnew = Xnew)
#out-of-sample prediction
MAE = hcat(abs.(Ynew - yhat_lasso_new), abs.(Ynew - yhat_post_lasso_new))
mean.(eachcol(MAE))
# names(MAE) = c("lasso MAE", "Post-lasso MAE")
# print(MAE, digits = 2)

2-element Vector{Float64}:
 1.4760345675461333
 1.0345207789641953

## Inference on Target Regression Coefficients

In [137]:
#41 Intuition for the Orthogonality Principle in Linear Models via Partialling Out
using DataFrames, Pipe
Random.seed!(1)
dta2 = r_data(2)
X = dta2[:, Not(1)]
y = dta2[:, 1]
d = dta2[:, 2]
n, p = size(X)
px = p - 2
# n = 5000
# p = 20
# X = rand(Normal(), (n, p))
# d = X[:, 1] #|> rename(_, :x1 => :d)
X1 = X[:, 2:p]
beta = ones(p)
# y = X * beta + randn(n);

In [130]:
using GLM

function intercept(mtrx)
    mtrx = Matrix(mtrx)
    return hcat(ones(size(mtrx, 1)), mtrx)
end

full_fit = GLM.lm(intercept(X), y)

est = round(coeftable(full_fit).cols[1][2], digits = 3)
s_td = round(coeftable(full_fit).cols[2][2], digits = 3)

print("Estimate: $est ($s_td)")


Estimate: 0.978 (0.014)

In [140]:

lm_y = lm(intercept(X1), y)
lm_d = lm(intercept(X1), d)
# lm_y
rY = GLM.residuals(lm_y)
rd = GLM.residuals(lm_d)

partial_fit_ls = lm(hcat(ones(n), rd), rY)

est = round(coeftable(partial_fit_ls).cols[1][2], digits = 3)
s_td = round(coeftable(partial_fit_ls).cols[2][2], digits = 3)

print("Estimate: $est ($s_td)")


Estimate: 0.978 (0.014)

In [10]:
rY = HDMjl.rlasso(X1, y)["residuals"]
rd = HDMjl.rlasso(X1, d)["residuals"]
# intercept(rd)
# rY
partial_fit_ls = GLM.lm(intercept(rd), rY[:, 1])


est = round(coeftable(partial_fit_ls).cols[1][2], digits = 3)
s_td = round(coeftable(partial_fit_ls).cols[2][2], digits = 3)

print("Estimate: $est ($s_td)")


Estimate: 0.998 (0.014)

## Instrumental Variable Esimation in a High-Dimensional Setting

In [141]:
Eff = HDMjl.rlassoEffect(X[:, Not(1)], y, X[:, 1], method = "partialling out")
HDMjl.r_summary(Eff);

Estimates and significance testing of the effect of target variables
 [1m Row [0m [1m Estimate. [0m [1m Std. Error [0m [1m t value [0m [1m Pr(>|t|) [0m

    1    0.972739    0.0136868   71.0715        0.0
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


In [144]:
Eff = HDMjl.rlassoEffect(X[:, Not(1)], y, X[:, 1], method = "double selection")
HDMjl.r_summary(Eff);

LoadError: DimensionMismatch("mismatch in dimension 1 (expected 5000 got 1)")

In [13]:
##42 Inference confidence Intervals and Significance Testing

n = 100
#sample size
p = 100
# number of variables
s = 3
# nubmer of non-zero variables
X = rand(Normal(), (n, p))
# X = matrix(rnorm(n * p), ncol = p)
# colnames(X) = paste("X", 1:p, sep = "")
beta = vcat(fill(3, s), zeros(p - s))
y = 1 .+ X * beta + randn(n);

In [14]:
lassoeffect = HDMjl.rlassoEffects(X, y, index = [1, 2, 3, 50])
HDMjl.r_print(lassoeffect)

Coefficients:

 [1m    X1 [0m [1m    X2 [0m [1m   X3 [0m [1m   X50 [0m

  2.914   2.916   2.83   0.119


In [15]:
HDMjl.r_summary(lassoeffect)

Estimates and significance testing of the effect of target variables
 [1m     [0m [1m Estimate. [0m [1m Std. Error [0m [1m t value [0m [1m     Pr(>|t|) [0m

 [1m  X1 [0m    2.91449    0.0994941   29.2931   1.27055e-188
 [1m  X2 [0m    2.91624    0.0977129    29.845   1.01869e-195
 [1m  X3 [0m    2.82964    0.0950844   29.7592   1.31723e-194
 [1m X50 [0m   0.119404    0.0930049   1.28385       0.199194
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


In [16]:
HDMjl.r_confint(lassoeffect)

 [1m     [0m [1m       2.5% [0m [1m    97.5% [0m

 [1m  X1 [0m     2.71948    3.10949
 [1m  X2 [0m     2.72473    3.10776
 [1m  X3 [0m     2.64328      3.016
 [1m X50 [0m  -0.0628818   0.301691


In [17]:
HDMjl.r_confint(lassoeffect, 0.95)

 [1m     [0m [1m       2.5% [0m [1m    97.5% [0m

 [1m  X1 [0m     2.71948    3.10949
 [1m  X2 [0m     2.72473    3.10776
 [1m  X3 [0m     2.64328      3.016
 [1m X50 [0m  -0.0628818   0.301691


#### plot_cof

In [18]:
# plot(lassoeffect, main = "Canfidence Intervals")

In [19]:
using RData, CodecXz, StatsModels, DataFrames
url = "https://github.com/cran/hdm/raw/master/data/cps2012.rda";
cps2012 = load(download(url))["cps2012"][1:500, :];

x_formula = @formula(lnw ~ -1 + female * (widowed + divorced + separated + nevermarried + hsd08 + hsd911 + hsg + cg + ad + mw + so + we + exp1 + exp2 + exp3)
 + +((widowed + divorced + separated + nevermarried + hsd08 + hsd911 + hsg + cg + ad + mw + so + we + exp1 + exp2 + exp3)* (widowed + divorced + separated + nevermarried + hsd08 + hsd911 + hsg + cg + ad + mw + so + we + exp1 + exp2 + exp3))
)
x_dframe = ModelFrame( x_formula, cps2012)
x1 = ModelMatrix(x_dframe)
x = x1.m
y = cps2012[:,"lnw"];
# rlassoEffects(x,y)

In [23]:
@time effects_female = HDMjl.rlassoEffects(x, y, index = vcat(1, 17:31));

 88.659953 seconds (59.42 M allocations: 175.342 GiB, 12.96% gc time)


In [132]:
typeof(x1)

ModelMatrix{Matrix{Float64}}

In [21]:
# # 43

library(hdm)
data(cps2012)
X = modelmatrix(~-1 + female + female:(widowed + divorced + separated + nevermarried +
hsd08 + hsd911 + hsg + cg + ad + mw + so + we + exp1 + exp2 + exp3) + +(widowed +
divorced + separated + nevermarried + hsd08 + hsd911 + hsg + cg + ad + mw + so +
we + exp1 + exp2 + exp3)^2, data = cps2012)
dim(X)
## [1] 29217
136
X = X[, which(apply(X, 2, var) != 0)]
# exclude all constant variables
dim(X)
## [1] 29217
116
indexgender = grep("female", colnames(X))
y = cps2012$lnw

In [22]:
Syssleep(10)
effectsfemale = rlassoEffects(x = X, y = y, index = indexgender)
summary(effectsfemale)

[1] "Estimates and significance testing of the effect of target variables"
                    Estimate. Std. Error t value Pr(>|t|)    
female              -0.154923   0.050162  -3.088 0.002012 ** 
female:widowed       0.136095   0.090663   1.501 0.133325    
female:divorced      0.136939   0.022182   6.174 6.68e-10 ***
female:separated     0.023303   0.053212   0.438 0.661441    
female:nevermarried  0.186853   0.019942   9.370  < 2e-16 ***
female:hsd08         0.027810   0.120914   0.230 0.818092    
female:hsd911       -0.119335   0.051880  -2.300 0.021435 *  
female:hsg          -0.012890   0.019223  -0.671 0.502518    
female:cg            0.010139   0.018327   0.553 0.580114    
female:ad           -0.030464   0.021806  -1.397 0.162405    
female:mw           -0.001063   0.019192  -0.055 0.955811    
female:so           -0.008183   0.019357  -0.423 0.672468    
female:we           -0.004226   0.021168  -0.200 0.841760    
female:exp1          0.004935   0.007804   0.632 0.527139

In [23]:
jointCI = confint(effectsfemale, level = 095, joint = true)
jointCI

Unnamed: 0,2.5 %,97.5 %
female,-0.29422452,-0.01562204
female:widowed,-0.13367117,0.40586213
female:divorced,0.07479695,0.19908182
female:separated,-0.11671664,0.16332216
female:nevermarried,0.12925782,0.24444915
female:hsd08,-0.37450228,0.43012291
female:hsd911,-0.26902488,0.0303548
female:hsg,-0.06513949,0.03935993
female:cg,-0.04168175,0.06195886
female:ad,-0.09583693,0.03490944


In [24]:
Syssleep(7)
effectsfemale = rlassoEffects(lnw ~ female + female:(widowed + divorced + separated +
nevermarried + hsd08 + hsd911 + hsg + cg + ad + mw + so + we + exp1 + exp2 +
exp3) + (widowed + divorced + separated + nevermarried + hsd08 + hsd911 + hsg +
cg + ad + mw + so + we + exp1 + exp2 + exp3)^2, data = cps2012, I = ~female +
female:(widowed + divorced + separated + nevermarried + hsd08 + hsd911 + hsg +
cg + ad + mw + so + we + exp1 + exp2 + exp3))

In [25]:
## 44

data(GrowthData)
dim(GrowthData)
## [1] 90 63
y = GrowthData[, 1, drop = F]
d = GrowthData[, 3, drop = F]
X = asmatrix(GrowthData)[, -c(1, 2, 3)]
varnames = colnames(GrowthData)

In [26]:
xnames = varnames[-c(1, 2, 3)]
# names of X variables
dandxnames = varnames[-c(1, 2)]
# names of D and X variables
# create formulas by pasting names (this saves typing times)
fmla = asformula(paste("Outcome ~ ", paste(dandxnames, collapse = "+")))
lseffect = lm(fmla, data = GrowthData)

In [27]:
dX = asmatrix(cbind(d, X))
lassoeffect = rlassoEffect(x = X, y = y, d = d, method = "partialling out")
summary(lassoeffect)

[1] "Estimates and significance testing of the effect of target variables"
     Estimate. Std. Error t value Pr(>|t|)    
[1,]  -0.04981    0.01394  -3.574 0.000351 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1



In [28]:
dX = asmatrix(cbind(d, X))
doubleseleffect = rlassoEffect(x = X, y = y, d = d, method = "double selection")
summary(doubleseleffect)

[1] "Estimates and significance testing of the effect of target variables"
         Estimate. Std. Error t value Pr(>|t|)   
gdpsh465  -0.05001    0.01579  -3.167  0.00154 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1



In [29]:
library(xtable)
table = rbind(summary(lseffect)$coef["gdpsh465", 1:2], summary(lassoeffect)$coef[,
1:2], summary(doubleseleffect)$coef[, 1:2])
colnames(table) = c("Estimate", "Std Error")
#names(summary(fullfit)£coef)[1:2]
rownames(table) = c("full reg via ols", "partial reg
via post-lasso ", "partial reg via double selection")
tab = xtable(table, digits = c(2, 2, 5))
tab

Unnamed: 0_level_0,Estimate,Std. Error
Unnamed: 0_level_1,<dbl>,<dbl>
full reg via ols,-0.009377989,0.02988773
partial reg via post-lasso,-0.049811465,0.01393636
partial reg via double selection,-0.050005855,0.01579138


## Inference on Treatment Effects in a Hight-Dimensional Setting

In [30]:
##51
data(AJR)
y = AJR$GDP
d = AJR$Exprop
z = AJR$logMort
x = modelmatrix(~-1 + (Latitude + Latitude2 + Africa + Asia + Namer + Samer)^2,
data = AJR)
dim(x)

In [31]:
AJRXselect = rlassoIV(GDP ~ Exprop + (Latitude + Latitude2 + Africa + Asia + Namer +
Samer)^2 | logMort + (Latitude + Latitude2 + Africa + Asia + Namer + Samer)^2,
data = AJR, selectX = true, selectZ = false)
summary(AJRXselect)

[1] "Estimation and significance testing of the effect of target variables in the IV regression model"
       coeff.    se. t-value p-value   
Exprop 0.8450 0.2699   3.131 0.00174 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1




In [32]:
confint(AJRXselect)

           2.5 %   97.5 %
Exprop 0.3159812 1.374072


In [33]:
fmlay = GDP ~ (Latitude + Latitude2 + Africa + Asia + Namer + Samer)^2
fmlad = Exprop ~ (Latitude + Latitude2 + Africa + Asia + Namer + Samer)^2
fmlaz = logMort ~ (Latitude + Latitude2 + Africa + Asia + Namer + Samer)^2
rY = lm(fmlay, data = AJR)$res
rD = lm(fmlad, data = AJR)$res
rZ = lm(fmlaz, data = AJR)$res
# ivfitlm = tsls(y=rY,d=rD, x=NULL, z=rZ, intercept=false)
ivfitlm = tsls(rY ~ rD | rZ, intercept = false)
print(cbind(ivfitlm$coef, ivfitlm$se), digits = 3)

   [,1] [,2]
rD 1.27 1.73


In [34]:
rY = rlasso(fmlay, data = AJR)$res
rD = rlasso(fmlad, data = AJR)$res
rZ = rlasso(fmlaz, data = AJR)$res
# ivfitlasso = tsls(y=rY,d=rD, x=NULL, z=rZ, intercept=false)
ivfitlasso = tsls(rY ~ rD | rZ, intercept = false)
print(cbind(ivfitlasso$coef, ivfitlasso$se), digits = 3)

    [,1] [,2]
rD 0.845 0.27


In [35]:
data(EminentDomain)
z = asmatrix(EminentDomain$logGDP$z)
x = asmatrix(EminentDomain$logGDP$x)
y = EminentDomain$logGDP$y
d = EminentDomain$logGDP$d
x = x[, apply(x, 2, mean, narm = true) > 005]
#
z = z[, apply(z, 2, mean, narm = true) > 005]
#

In [36]:
EDols = lm(y ~ cbind(d, x))
ED2sls = tsls(y = y, d = d, x = x, z = z[, 1:2], intercept = false)

In [37]:
lassoIVZ = rlassoIV(x = x, d = d, y = y, z = z, selectX = false, selectZ = true)
# or lassoIVZ = rlassoIVselectZt(x=X, d=d, y=y, z=z)
summary(lassoIVZ)

[1] "Estimates and significance testing of the effect of target variables in the IV regression model"
   coeff.    se. t-value p-value
d1 0.4146 0.2902   1.428   0.153




In [38]:
confint(lassoIVZ)

        2.5 %    97.5 %
d1 -0.1542764 0.9834796


In [39]:
lassoIVXZ = rlassoIV(x = x, d = d, y = y, z = z, selectX = true, selectZ = true)
summary(lassoIVXZ)

Estimates and Significance Testing of the effect of target variables in the IV regression model 
     coeff.      se. t-value p-value
d1 -0.02383  0.12851  -0.185   0.853




In [40]:
confint(lassoIVXZ)

        2.5 %    97.5 %
d1 -0.2757029 0.2280335


In [41]:
library(xtable)
table = matrix(0, 4, 2)
table[1, ] = summary(EDols)$coef[2, 1:2]
table[2, ] = cbind(ED2sls$coef[1], ED2sls$se[1])
table[3, ] = summary(lassoIVZ)[, 1:2]

[1] "Estimates and significance testing of the effect of target variables in the IV regression model"
   coeff.    se. t-value p-value
d1 0.4146 0.2902   1.428   0.153




In [42]:
table[4, ] = summary(lassoIVXZ)[, 1:2]

Estimates and Significance Testing of the effect of target variables in the IV regression model 
     coeff.      se. t-value p-value
d1 -0.02383  0.12851  -0.185   0.853




In [43]:
colnames(table) = c("Estimate", "Std Error")
rownames(table) = c("ols regression", "IV estimation ", "selection on Z", "selection on X and Z")
tab = xtable(table, digits = c(2, 2, 7))
tab

Unnamed: 0_level_0,Estimate,Std. Error
Unnamed: 0_level_1,<dbl>,<dbl>
ols regression,0.007864732,0.009865927
IV estimation,-0.010733269,0.033766362
selection on Z,0.414601641,0.290249208
selection on X and Z,-0.023834697,0.128506538


In [44]:
data(pension)
y = pension$tw
d = pension$p401
z = pension$e401
X = pension[, c("i2", "i3", "i4", "i5", "i6", "i7", "a2", "a3", "a4", "a5", "fsize",
"hs", "smcol", "col", "marr", "twoearn", "db", "pira", "hown")]
# simple model
xvar = c("i2", "i3", "i4", "i5", "i6", "i7", "a2", "a3", "a4", "a5", "fsize", "hs",
"smcol", "col", "marr", "twoearn", "db", "pira", "hown")
xpart = paste(xvar, collapse = "+")
form = asformula(paste("tw ~ ", paste(c("p401", xvar), collapse = "+"), "|", paste(xvar,
collapse = "+")))
formZ = asformula(paste("tw ~ ", paste(c("p401", xvar), collapse = "+"), "|", paste(c("e401",
xvar), collapse = "+")))

In [45]:
pensionate = rlassoATE(form, data = pension)
summary(pensionate)

Estimation and significance testing of the treatment effect 
Type: ATE 
Bootstrap: not applicable 
   coeff.   se. t-value  p-value    
TE  10180  1931   5.273 1.34e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1




In [46]:
pensionatet = rlassoATET(form, data = pension)
summary(pensionatet)

Estimation and significance testing of the treatment effect 
Type: ATET 
Bootstrap: not applicable 
   coeff.   se. t-value p-value    
TE  12628  2944   4.289 1.8e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1




### Error

In [53]:
# pensionlate = rlassoLATE(X, d, y, z)
# pensionlate = rlassoLATE(formZ, data=pension)
# summary(pensionlate)

In [48]:
# pensionlatet = rlassoLATET(X, d, y, z)

In [49]:
xvar2 = paste("(", xvar, ")^2", sep = "")
formExt = asformula(paste("tw ~ ", paste(c("p401", xvar2), collapse = "+"), "|",
paste(xvar2, collapse = "+")))
formZExt = asformula(paste("tw ~ ", paste(c("p401", xvar2), collapse = "+"), "|",
paste(c("e401", xvar2), collapse = "+")))

In [50]:
pensionate = rlassoATE(X, z, y)
pensionatet = rlassoATET(X, z, y)
# pensionlate = rlassoLATE(X, d, y, z)
# pensionlatet = rlassoLATET(X, d, y, z)

## The Lasso Methods for Discovery of Significant Causes amongst Many Potential Causes, with Many Controls

In [54]:
setseed(1)
n = 100
p1 = 20
p2 = 20
D = matrix(rnorm(n * p1), n, p1)
# Causes
W = matrix(rnorm(n * p2), n, p2)
X = cbind(D, W)
# Regressors
Y = D[, 1] * 5 + W[, 1] * 5 + rnorm(n)
#Outcome
confint(rlassoEffects(X, Y, index = c(1:p1)), joint = true)

Unnamed: 0,2.5 %,97.5 %
V1,4.5145877,5.21430498
V2,-0.3142909,0.3049465
V3,-0.3524109,0.1867888
V4,-0.254243,0.28738914
V5,-0.2765802,0.27627177
V6,-0.3214676,0.29422684
V7,-0.2262507,0.30094168
V8,-0.0473541,0.47366372
V9,-0.1865636,0.3902352
V10,-0.2372356,0.26411185
