## 1.0The Bootstrap

In this lab, we analyze the Pennsylvania re-employment bonus experiment, which was previously studied in "Sequential testing of duration data: the case of the Pennsylvania ‘reemployment bonus’ experiment" (Bilias, 2000), among others. These experiments were conducted in the 1980s by the U.S. Department of Labor to test the incentive effects of alternative compensation schemes for unemployment insurance (UI). In these experiments, UI claimants were randomly assigned either to a control group or one of five treatment groups. Actually, there are six treatment groups in the experiments. Here we focus on treatment group 4, but feel free to explore other treatment groups. In the control group the current rules of the UI applied. Individuals in the treatment groups were offered a cash bonus if they found a job within some pre-specified period of time (qualification period), provided that the job was retained for a specified duration. The treatments differed in the level of the bonus, the length of the qualification period, and whether the bonus was declining over time in the qualification period; see http://qed.econ.queensu.ca/jae/2000-v15.6/bilias/readme.b.txt for further details on data.

In [25]:
using LinearAlgebra, GLM, DataFrames, Statistics, Random, Distributions, 
DataStructures, NamedArrays, PrettyTables, StatsModels, Combinatorics, CSV, RData 

import CodecBzip2
using FilePaths

In [26]:
using GLM, StatsModels
using DataTables
using DelimitedFiles, DataFrames, Lasso
using FilePaths
using StatsModels, Combinatorics
using CategoricalArrays
using StatsBase, Statistics
using TypedTables
using MacroTools
using NamedArrays
using PrettyTables # Dataframe or Datatable to latex
using TexTables # pretty regression table and tex outcome

In [37]:
# Loading data

mat, head = readdlm("../../../data/penn_jae.dat", header=true, Float64)
mat
df =DataFrame(mat, vec(head))
describe(df)

#dimenntions of dataframe 

a = size(df,1)
b =  size(df,2)

# Filter control group and just treatment group number 4

penn = filter(row -> row[:tg] in [4,0], df)

# Treatment group n°4
replace!(penn.tg, 4 => 1)

rename!(penn, "tg" => "T4")


# from float to string
penn[!,:dep] = string.(penn[!,:dep]) 

# dep varaible in categorical format 
penn[!,:dep] = categorical(penn[!,:dep])

describe(penn)# Importing .Rdata file

n = size(penn)[1]

5099

In [43]:
function boot_fn(data,index)
            ols_1 = lm(@formula(log(inuidur1) ~ T4 + female+black+othrace+dep+q2+q3+q4+q5+q6+agelt35+agegt54+durable+lusd+husd), data[index,:])
            intercept = GLM.coeftable(ols_1).cols[1][1]
            coef_1 = GLM.coeftable(ols_1).cols[1][2]
            coef_2 = GLM.coeftable(ols_1).cols[1][3]
            coef_3 = GLM.coeftable(ols_1).cols[1][4]
            return [intercept, coef_1, coef_2, coef_3]
end

boot_fn (generic function with 1 method)

In [46]:
function boot_2(data,func,R)
            intercept = []
            coeff_1 = []
            coeff_2 = []
            coeff_3 = []
    
            for i in 1:R
                append!(intercept,func(data,sample([1:n;], n, replace = true))[1])
                append!(coeff_1,func(data,sample([1:n;], n, replace = true))[2])
                append!(coeff_2,func(data,sample([1:n;], n, replace = true))[3])
                append!(coeff_3,func(data,sample([1:n;], n, replace = true))[4])
            end
    
        table = NamedArray(zeros(3, 3))

        #table[1,2] = mean(intercept)
        #table[1,3] = std(intercept, corrected=true)
        table[1,2] = mean(coeff_1)
        table[1,3] = std(coeff_1, corrected=true)
        table[2,2] = mean(coeff_2)
        table[2,3] = std(coeff_2, corrected=true)
        table[3,2] = mean(coeff_3)
        table[3,3] = std(coeff_3, corrected=true)
        T = DataFrame(table, [ :"Variable", :"Coefficient (boostrap)", :"Standar error (boostrap)"]) 
        T[!,:Variable] = string.(T[!,:Variable]) 

        T[1,1] = "T4"
        T[2,1] = "Female"
        T[3,1] = "Black"

        
        bootstrap_statistics = Dict{String,Any}("Table" => T)
    return bootstrap_statistics
end

boot_2 (generic function with 1 method)

In [47]:
boot_2(penn,boot_fn,1000)["Table"]

Unnamed: 0_level_0,Variable,Coefficient (boostrap),Standar error (boostrap)
Unnamed: 0_level_1,String,Float64,Float64
1,T4,-0.0732922,0.0355541
2,Female,0.124125,0.034041
3,Black,-0.292912,0.0576301


## 2.0 Comparative model

In [48]:
cps2012 = load("../../../data/cps2012.RData")
data = cps2012["data"]

Unnamed: 0_level_0,year,lnw,female,widowed,divorced,separated,nevermarried,hsd08
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,2012.0,1.90954,1.0,0.0,0.0,0.0,0.0,0.0
2,2012.0,1.36577,1.0,0.0,0.0,0.0,0.0,0.0
3,2012.0,2.54022,0.0,0.0,0.0,0.0,0.0,0.0
4,2012.0,1.80109,1.0,0.0,0.0,0.0,0.0,0.0
5,2012.0,3.3499,0.0,0.0,0.0,0.0,0.0,0.0
6,2012.0,2.00283,0.0,0.0,0.0,0.0,0.0,0.0
7,2012.0,2.45609,0.0,0.0,0.0,0.0,1.0,0.0
8,2012.0,3.57305,0.0,0.0,0.0,0.0,0.0,0.0
9,2012.0,2.51366,0.0,0.0,0.0,0.0,0.0,0.0
10,2012.0,0.289633,1.0,0.0,0.0,0.0,0.0,0.0


In [49]:
Random.seed!(1234)

TaskLocalRNG()

In [50]:
training = sample( collect(1:nrow( data ) ), trunc(Int, 3 * nrow( data ) / 4 ),  replace= false )

data_train = data[ vec(training), : ]
data_test = data[ Not(training), : ]

Unnamed: 0_level_0,year,lnw,female,widowed,divorced,separated,nevermarried,hsd08
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,2012.0,1.90954,1.0,0.0,0.0,0.0,0.0,0.0
2,2012.0,2.00283,0.0,0.0,0.0,0.0,0.0,0.0
3,2012.0,0.289633,1.0,0.0,0.0,0.0,0.0,0.0
4,2012.0,3.14226,1.0,0.0,1.0,0.0,0.0,0.0
5,2012.0,1.75389,0.0,0.0,0.0,0.0,1.0,0.0
6,2012.0,2.05892,0.0,0.0,0.0,0.0,0.0,0.0
7,2012.0,2.42792,0.0,0.0,0.0,0.0,0.0,0.0
8,2012.0,3.37343,0.0,0.0,0.0,0.0,0.0,0.0
9,2012.0,2.52323,1.0,0.0,0.0,0.0,0.0,0.0
10,2012.0,1.8693,1.0,0.0,0.0,0.0,0.0,0.0


We construct the two different model matrices $X_{basic}$ and $X_{flex}$ for both the training and the test sample:

In [55]:
formula_basic = @formula(lnw ~ female + female&(widowed + divorced + separated + nevermarried +
hsd08 + hsd911 + hsg + cg + ad + mw + so + we + exp1 + exp2 + exp3) )

FormulaTerm
Response:
  lnw(unknown)
Predictors:
  female(unknown)
  female(unknown) & widowed(unknown)
  female(unknown) & divorced(unknown)
  female(unknown) & separated(unknown)
  female(unknown) & nevermarried(unknown)
  female(unknown) & hsd08(unknown)
  female(unknown) & hsd911(unknown)
  female(unknown) & hsg(unknown)
  female(unknown) & cg(unknown)
  female(unknown) & ad(unknown)
  female(unknown) & mw(unknown)
  female(unknown) & so(unknown)
  female(unknown) & we(unknown)
  female(unknown) & exp1(unknown)
  female(unknown) & exp2(unknown)
  female(unknown) & exp3(unknown)

In [58]:
    # couples variables combinations 
    combinations_upto(x, n) = Iterators.flatten(combinations(x, i) for i in 1:n)

    # combinations without same couple
    expand_exp(args, deg::ConstantTerm) =
        tuple(((&)(terms...) for terms in combinations_upto(args, deg.n))...)

    StatsModels.apply_schema(t::FunctionTerm{typeof(^)}, sch::StatsModels.Schema, ctx::Type) =
        apply_schema.(expand_exp(t.args_parsed...), Ref(sch), ctx)

In [60]:
formula_flex = @formula(lnw ~ female + female&(widowed + divorced + separated + nevermarried +
hsd08 + hsd911 + hsg + cg + ad + mw + so + we + exp1 + exp2 + exp3) + (widowed +
divorced + separated + nevermarried + hsd08 + hsd911 + hsg + cg + ad + mw + so +
we + exp1 + exp2 + exp3)^2)

formula_flex = apply_schema(formula_flex, schema(formula_flex, data))

FormulaTerm
Response:
  lnw(continuous)
Predictors:
  female(continuous)
  widowed(continuous)
  divorced(continuous)
  separated(continuous)
  nevermarried(continuous)
  hsd08(continuous)
  hsd911(continuous)
  hsg(continuous)
  cg(continuous)
  ad(continuous)
  mw(continuous)
  so(continuous)
  we(continuous)
  exp1(continuous)
  exp2(continuous)
  exp3(continuous)
  widowed(continuous) & divorced(continuous)
  widowed(continuous) & separated(continuous)
  widowed(continuous) & nevermarried(continuous)
  widowed(continuous) & hsd08(continuous)
  widowed(continuous) & hsd911(continuous)
  widowed(continuous) & hsg(continuous)
  widowed(continuous) & cg(continuous)
  widowed(continuous) & ad(continuous)
  widowed(continuous) & mw(continuous)
  widowed(continuous) & so(continuous)
  widowed(continuous) & we(continuous)
  widowed(continuous) & exp1(continuous)
  widowed(continuous) & exp2(continuous)
  widowed(continuous) & exp3(continuous)
  divorced(continuous) & separated(continuous)


In [61]:
model_X_basic_train = ModelMatrix(ModelFrame(formula_basic,data_train)).m
model_X_basic_test = ModelMatrix(ModelFrame(formula_basic,data_test)).m
p_basic = size(model_X_basic_test)[2]

17

In [73]:
model_X_flex_train = ModelMatrix(ModelFrame(formula_flex,data_train)).m
model_X_flex_test = ModelMatrix(ModelFrame(formula_flex,data_test)).m
p_flex = size(model_X_flex_test)[2]

137

In [63]:
Y_train = data_train[!, ["lnw"]] # Dataframe format
Y_test = data_test[ !,  ["lnw"]]

Unnamed: 0_level_0,lnw
Unnamed: 0_level_1,Float64
1,1.90954
2,2.00283
3,0.289633
4,3.14226
5,1.75389
6,2.05892
7,2.42792
8,3.37343
9,2.52323
10,1.8693


In [64]:
p_basic
p_flex

137

As known from our first lab, the basic model consists of $10$ regressors and the flexible model of $246$ regressors. Let us fit our models to the training sample using the two different model specifications. We are starting by running a simple ols regression. 

### OLS

We fit the basic model to our training data by running an ols regression and compute the mean squared error on the test sample.

In [66]:
fit_lm_basic = lm(formula_basic, data_train)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

lnw ~ 1 + female + female & widowed + female & divorced + female & separated + female & nevermarried + female & hsd08 + female & hsd911 + female & hsg + female & cg + female & ad + female & mw + female & so + female & we + female & exp1 + female & exp2 + female & exp3

Coefficients:
───────────────────────────────────────────────────────────────────────────────────────────
                             Coef.  Std. Error       t  Pr(>|t|)     Lower 95%    Upper 95%
───────────────────────────────────────────────────────────────────────────────────────────
(Intercept)             2.90357     0.00555009  523.16    <1e-99   2.89269       2.91445
female                 -0.653932    0.0462527   -14.14    <1e-44  -0.74459      -0.563273
female & widowed       -0.122183    0.0559409    -2.18    0.0290  -0.231831     -0.0125352
female

In [82]:
# Compute the Out-Of-Sample Performance
yhat_lm_basic = GLM.predict( fit_lm_basic , data_test )
res_lm_basic = ( Y_test[!,1] - yhat_lm_basic ).^ 2

matrix_ones = ones( size(res_lm_basic)[1] ,1 )
mean_residuals = lm(  matrix_ones, res_lm_basic )   # first argument (X), secind argument (Y)
MSE_lm_basic = [ coef( mean_residuals ) , stderror( mean_residuals ) ]
MSE_lm_basic  
R2_lm_basic = 1 .- ( MSE_lm_basic[1] / var( Y_test[!,1] ) )

1-element Vector{Float64}:
 0.1186694346581677

In [83]:
# ols (flexible model)
fit_lm_flex = lm( formula_flex, data_train ) 
yhat_lm_flex = GLM.predict( fit_lm_flex, data_test)

res_lm_flex = ( Y_test[!,1] - yhat_lm_flex ) .^ 2
mean_residuals = lm(  matrix_ones, res_lm_flex )
MSE_lm_flex = [ coef( mean_residuals ) , stderror( mean_residuals ) ]

R2_lm_flex = 1 .- ( MSE_lm_flex[1] / var( Y_test[!,1] ) )

1-element Vector{Float64}:
 0.2487068543236849

We observe that ols regression works better for the basic model with smaller $p/n$ ratio. We are proceeding by running lasso regressions and its versions.

### Lasso, Ridge and Elastic Net


Considering the basic model, we run a lasso/post-lasso regression first and then we compute the measures for the out-of-sample performance. Note that applying the package *hdm* and the function *rlasso* we rely on a theoretical based choice of the penalty level $\lambda$ in the lasso regression.

In [86]:
# load HDM package

include("hdmjl/hdmjl.jl")

In [87]:
names_col1 = Symbol.(coefnames(fit_lm_basic))
X1 = DataFrame(model_X_basic_train, names_col1 )

Unnamed: 0_level_0,(Intercept),female,female & widowed,female & divorced,female & separated,female & nevermarried
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64
1,1.0,1.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0
4,1.0,1.0,0.0,0.0,0.0,0.0
5,1.0,1.0,0.0,0.0,0.0,0.0
6,1.0,1.0,0.0,0.0,0.0,0.0
7,1.0,0.0,0.0,0.0,0.0,0.0
8,1.0,1.0,1.0,0.0,0.0,0.0
9,1.0,1.0,0.0,0.0,0.0,0.0
10,1.0,0.0,0.0,0.0,0.0,0.0


In [88]:
# basic model 
# not post - lasso (HDM)
# first false for Not post lasso, second false for not intercetp


rlasso_basic  = rlasso_arg( X1, Y_train, nothing, false, false, true, false, false, 
                    nothing, 1.1, nothing, 5000, 15, 10^(-5), -Inf, true, Inf, true )

fit_rlasso_basic = rlasso(rlasso_basic)


# post - lasso (HDM)
rlasso_basic_post  = rlasso_arg( X1, Y_train, nothing, true, false, true, false, false, 
                    nothing, 1.1, nothing, 5000, 15, 10^(-5), -Inf, true, Inf, true )

fit_rlasso_basic_post = rlasso(rlasso_basic_post)

yhat_rlasso = model_X_basic_test*fit_rlasso_basic["coefficients"] 
yhat_rlasso_post = model_X_basic_test*fit_rlasso_basic_post["coefficients"] 


res_rlasso_basic = ( Y_test[!,1] - yhat_rlasso ).^ 2
matrix_ones = ones( size(res_rlasso_basic)[1] ,1 )
mean_residuals = lm(  matrix_ones, res_rlasso_basic )  
MSE_rlasso_basic = [ coef( mean_residuals ) , stderror( mean_residuals ) ]
R2_rlasso_basic = 1 .- ( MSE_rlasso_basic[1] / var(Y_test[!,1]) ) 

res_rlasso_basic_post = ( Y_test[!,1] - yhat_rlasso_post ).^ 2
matrix_ones = ones( size(res_rlasso_basic_post)[1] ,1 )
mean_residuals = lm(  matrix_ones, res_rlasso_basic_post )  
MSE_rlasso_basic_post = [ coef( mean_residuals ) , stderror( mean_residuals ) ]
R2_rlasso_basic_post = 1 .- ( MSE_rlasso_basic_post[1] / var(Y_test[!,1]) ) 


1-element Vector{Float64}:
 0.11206746334526163

## Flexible Model 

In [89]:
names_col2 = Symbol.(coefnames(fit_lm_flex))
X2 = DataFrame(model_X_flex_train, names_col2 )

Unnamed: 0_level_0,(Intercept),female,widowed,divorced,separated,nevermarried,hsd08,hsd911
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
5,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
6,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
7,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
9,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
10,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [90]:
# Flex - model
# Not post - lasso (HDM)


rlasso_flex  = rlasso_arg( X2, Y_train, nothing, false, false, true, false, false, 
                    nothing, 1.1, nothing, 5000, 15, 10^(-5), -Inf, true, Inf, true )

fit_rlasso_flex = rlasso(rlasso_flex)


# post - lasso (HDM)
rlasso_flex_post  = rlasso_arg( X2, Y_train, nothing, true, false, true, false, false, 
                    nothing, 1.1, nothing, 5000, 15, 10^(-5), -Inf, true, Inf, true )

fit_rlasso_flex_post = rlasso(rlasso_flex_post)

yhat_rlasso_flex = model_X_flex_test*fit_rlasso_flex["coefficients"] 
yhat_rlasso_flex_post = model_X_flex_test*fit_rlasso_flex_post["coefficients"] 


res_rlasso_flex = ( Y_test[!,1] - yhat_rlasso_flex ).^ 2
matrix_ones = ones( size(res_rlasso_flex)[1] ,1 )
mean_residuals = lm(  matrix_ones, res_rlasso_flex )  
MSE_rlasso_flex = [ coef( mean_residuals ) , stderror( mean_residuals ) ]
R2_rlasso_flex = 1 .- ( MSE_rlasso_flex[1] / var(Y_test[!,1]) ) 

res_rlasso_flex_post = ( Y_test[!,1] - yhat_rlasso_flex_post ).^ 2
matrix_ones = ones( size(res_rlasso_flex_post)[1] ,1 )
mean_residuals = lm(  matrix_ones, res_rlasso_flex_post )  
MSE_rlasso_flex_post = [ coef( mean_residuals ) , stderror( mean_residuals ) ]
R2_rlasso_flex_post = 1 .- ( MSE_rlasso_flex_post[1] / var(Y_test[!,1]) ) 


1-element Vector{Float64}:
 0.2429304764005944

Now, we repeat the same procedure for the flexible model.

It is worth to notice that lasso regression works better for the more complex model.

In contrast to a theoretical based choice of the tuning parameter $\lambda$ in the lasso regression, we can also use cross-validation to determine the penalty level by applying the package *glmnet* and the function cv.glmnet. In this context, we also run a ridge and a elastic net regression by adjusting the parameter *alpha*.

In [91]:
#import Pkg; Pkg.add("GLMNet")
using GLMNet



In [93]:
fit_lasso_cv   = GLMNet.glmnetcv(model_X_basic_train, Y_train[!,1], alpha=1)
fit_ridge   = GLMNet.glmnetcv(model_X_basic_train, Y_train[!,1], alpha=0)
fit_elnet   = GLMNet.glmnetcv(model_X_basic_train, Y_train[!,1], alpha= 0.5)

yhat_lasso_cv    = GLMNet.predict(fit_lasso_cv,  model_X_basic_test)
yhat_ridge   = GLMNet.predict(fit_ridge,  model_X_basic_test)
yhat_elnet   = GLMNet.predict(fit_elnet,  model_X_basic_test)

res_lasso_cv = ( Y_test[!,1] - yhat_lasso_cv ) .^ 2
mean_residuals = lm(  matrix_ones, res_lasso_cv )
MSE_lasso_cv = [ coef( mean_residuals ) , stderror( mean_residuals ) ]

res_ridge = ( Y_test[!,1] - yhat_ridge ) .^ 2
mean_residuals = lm(  matrix_ones, res_ridge )
MSE_ridge = [ coef( mean_residuals ) , stderror( mean_residuals ) ]

res_elnet = ( Y_test[!,1] - yhat_elnet ) .^ 2
mean_residuals = lm(  matrix_ones, res_elnet )
MSE_elnet = [ coef( mean_residuals ) , stderror( mean_residuals ) ]

R2_lasso_cv = 1 .- ( MSE_lasso_cv[1] / var( Y_test[!,1] ) )
R2_ridge = 1 .- ( MSE_ridge[1] / var( Y_test[!,1] ) )
R2_elnet = 1 .- ( MSE_elnet[1] / var( Y_test[!,1] ) )


1-element Vector{Float64}:
 0.11790348189933608

Note that the following calculations for the flexible model need some computation time.

In [94]:
fit_lasso_cv_flex   = GLMNet.glmnetcv(model_X_flex_train, Y_train[!,1], alpha=1)
fit_ridge_flex   = GLMNet.glmnetcv(model_X_flex_train, Y_train[!,1], alpha=0)
fit_elnet_flex   = GLMNet.glmnetcv(model_X_flex_train, Y_train[!,1], alpha= 0.5)

yhat_lasso_cv_flex    = GLMNet.predict(fit_lasso_cv_flex,  model_X_flex_test)
yhat_ridge_flex   = GLMNet.predict(fit_ridge_flex,  model_X_flex_test)
yhat_elnet_flex   = GLMNet.predict(fit_elnet_flex,  model_X_flex_test)

res_lasso_cv_flex = ( Y_test[!,1] - yhat_lasso_cv_flex ) .^ 2
mean_residuals = lm(  matrix_ones, res_lasso_cv_flex )
MSE_lasso_cv_flex = [ coef( mean_residuals ) , stderror( mean_residuals ) ]

res_ridge_flex = ( Y_test[!,1] - yhat_ridge_flex ) .^ 2
mean_residuals = lm(  matrix_ones, res_ridge_flex )
MSE_ridge_flex = [ coef( mean_residuals ) , stderror( mean_residuals ) ]

res_elnet_flex = ( Y_test[!,1] - yhat_elnet_flex ) .^ 2
mean_residuals = lm(  matrix_ones, res_elnet_flex )
MSE_elnet_flex = [ coef( mean_residuals ) , stderror( mean_residuals ) ]

R2_lasso_cv_flex = ( 1 .- ( MSE_lasso_cv_flex[1] / var( Y_test[!,1] ) ) )[1]
R2_ridge_flex = ( 1 .- ( MSE_ridge_flex[1] / var( Y_test[!,1] ) ) )[1]
R2_elnet_flex = ( 1 .- ( MSE_elnet_flex[1] / var( Y_test[!,1] ) ) )[1]


0.25038274567453134

The performance of the lasso regression with cross-validated penalty is quite similar to the performance of lasso using a theoretical based choice of the tuning parameter.

## Non-linear models

Besides linear regression models, we consider nonlinear regression models to build a predictive model. We are applying regression trees, random forests, boosted trees and neural nets to estimate the regression function $g(X)$. First, we load the relevant libraries

In [34]:
#import Pkg; Pkg.add( "ScikitLearn" )

#import Pkg; Pkg.add("DecisionTree")

Lathe has no model for random forest regression

In [95]:
using ScikitLearn, DecisionTree



## Using build_tree

#### model = build_tree(labels, features,   n_subfeatures,   max_depth,   min_samples_leaf,  min_samples_split,  min_purity_increase)

In [96]:
model = build_tree(Y_train[!,1], model_X_basic_train)

Decision Tree
Leaves: 890
Depth:  22

In [149]:
#Prune tree

model = build_tree(Y_train[!,1], model_X_basic_train, 0, 11,1,2,0.006)

#n_subfeatures = 0
#max_depth = 11
# min_samples_leaf = 1
# min_samples_split = 2
# min_purity_increase = 0.006

Decision Tree
Leaves: 32
Depth:  11

https://github.com/bensadeghi/DecisionTree.jl

https://docs.juliahub.com/DecisionTree/pEDeB/0.10.5/

It is worth to notice that lasso regression works better for the more complex model.

In contrast to a theoretical based choice of the tuning parameter $\lambda$ in the lasso regression, we can also use cross-validation to determine the penalty level by applying the package *glmnet* and the function cv.glmnet. In this context, we also run a ridge and a elastic net regression by adjusting the parameter *alpha*.

In [150]:
y_hat_tree = apply_tree(model, model_X_basic_test)

7305-element Vector{Float64}:
 2.5029087408669666
 2.903619152514961
 2.5029087408669666
 2.5029087408669666
 2.903619152514961
 2.903619152514961
 2.903619152514961
 2.903619152514961
 2.897098822800502
 2.5029087408669666
 2.903619152514961
 2.903619152514961
 2.903619152514961
 ⋮
 2.903619152514961
 2.903619152514961
 2.5029087408669666
 2.63628306919746
 2.903619152514961
 2.5029087408669666
 2.890112781324435
 2.903619152514961
 3.0957540599253988
 2.903619152514961
 3.0957540599253988
 2.903619152514961

In [151]:
res_tree = ( Y_test[!,1] - y_hat_tree ) .^ 2
mean_residuals = lm(  matrix_ones, res_tree )
MSE_prune_tree = [ coef( mean_residuals ) , stderror( mean_residuals ) ]

R2_prune_tree = ( 1 .- ( MSE_tree[1] / var( Y_test[!,1] ) ) )[1]

print("R^2 using Prune - tree:", R2_tree)

R^2 using Prune - tree:0.04803104423131266

## Results

In [152]:
table = NamedArray(zeros(13, 4))

table[1,2:3] = [MSE_lm_basic[1][1], MSE_lm_basic[2][1]]
table[2,2:3] = [MSE_lm_flex[1][1], MSE_lm_flex[2][1]]
table[3,2:3] = [MSE_rlasso_basic[1][1], MSE_rlasso_basic[2][1]]
table[4,2:3] = [MSE_rlasso_basic_post[1][1], MSE_rlasso_basic_post[2][1]]
table[5,2:3] = [MSE_rlasso_flex[1][1], MSE_rlasso_flex[2][1]]
table[6,2:3] = [MSE_rlasso_flex_post[1][1], MSE_rlasso_flex_post[2][1]]
table[7,2:3] = [MSE_lasso_cv[1][1], MSE_lasso_cv[2][1]]
table[8,2:3] = [MSE_ridge[1][1], MSE_ridge[2][1]]
table[9,2:3] = [MSE_elnet[1][1], MSE_elnet[2][1]]
table[10,2:3] = [MSE_lasso_cv_flex[1][1], MSE_lasso_cv_flex[2][1]]
table[11,2:3] = [MSE_ridge_flex[1][1], MSE_ridge_flex[2][1]]
table[12,2:3] = [MSE_elnet_flex[1][1], MSE_elnet_flex[2][1]]
table[13,2:3] = [MSE_prune_tree[1][1], MSE_prune_tree[2][1]]

table[1,4] = R2_lm_basic[1]
table[2,4] = R2_lm_flex[1]
table[3,4] = R2_rlasso_basic[1]
table[4,4] = R2_rlasso_basic_post[1]
table[5,4] = R2_rlasso_flex[1]
table[6,4] = R2_rlasso_flex_post[1]
table[7,4] = R2_lasso_cv[1]
table[8,4] = R2_ridge[1]
table[9,4] = R2_elnet[1]
table[10,4] = R2_lasso_cv_flex[1]
table[11,4] = R2_ridge_flex[1]
table[12,4] = R2_elnet_flex[1]
table[13,4] = R2_prune_tree[1]

T = DataFrame(table, [ :"Model",:"MSE", :"S.E. for MSE", :"R-squared"]) 
T[!,:Model] = string.(T[!,:Model]) 

T[1,1] = "Least Squares (basic)"
T[2,1] = "Least Squares (flexible)"
T[3,1] = "Lasso"
T[4,1] = "Post-Lasso"
T[5,1] = "Lasso (flexible)"
T[6,1] = "Post-Lasso (flexible)"
T[7,1] = "Cross-Validated lasso"
T[8,1] = "Cross-Validated ridge"
T[9,1] = "Cross-Validated elnet"
T[10,1] = "Cross-Validated lasso (flexible)"
T[11,1] = "Cross-Validated ridge (flexible)"
T[12,1] = "Cross-Validated elnet (flexible)"
T[13,1] = "Pruned Tree"

header = (["Model", "MSE", "S.E. for MSE", "R-squared"])

pretty_table(T; backend = Val(:html), header = header, formatters=ft_round(4), alignment=:c)

Model,MSE,S.E. for MSE,R-squared
Least Squares (basic),0.3952,0.0232,0.1187
Least Squares (flexible),0.3369,0.0234,0.2487
Lasso,0.4028,0.0233,0.1016
Post-Lasso,0.3982,0.0231,0.1121
Lasso (flexible),0.3465,0.0231,0.2272
Post-Lasso (flexible),0.3395,0.0231,0.2429
Cross-Validated lasso,0.3955,0.0232,0.118
Cross-Validated ridge,0.3965,0.0232,0.1158
Cross-Validated elnet,0.3955,0.0232,0.1179
Cross-Validated lasso (flexible),0.3362,0.0233,0.2503
