

This notebook contains an example for teaching.


# A Simple Case Study using Wage Data from 2015 - proceeding

So far we considered many machine learning method, e.g Lasso and Random Forests, to build a predictive model. In this lab, we extend our toolbox by predicting wages by a neural network.

## Data preparation

Again, we consider data from the U.S. March Supplement of the Current Population Survey (CPS) in 2015.

In [359]:
using RData, LinearAlgebra, GLM, DataFrames, Statistics, Random, Distributions, DataStructures, NamedArrays, PrettyTables
import CodecBzip2

In [360]:
# Importing .Rdata file
rdata_read = load("../data/wage2015_subsample_inference.RData")

Dict{String, Any} with 1 entry:
  "data" => [1m5150×20 DataFrame[0m…

In [361]:
# Since rdata_read is a dictionary, we check if there is a key called "data", the one we need for our analyze
haskey(rdata_read, "data")

true

In [362]:
# Now we save that dataframe with a new name
data = rdata_read["data"]
names(data)

20-element Vector{String}:
 "wage"
 "lwage"
 "sex"
 "shs"
 "hsg"
 "scl"
 "clg"
 "ad"
 "mw"
 "so"
 "we"
 "ne"
 "exp1"
 "exp2"
 "exp3"
 "exp4"
 "occ"
 "occ2"
 "ind"
 "ind2"

In [363]:
typeof(data), size(data)

(DataFrame, (5150, 20))

In [364]:
Z =  select(data, ["lwage", "wage"])     # regressors

Unnamed: 0_level_0,lwage,wage
Unnamed: 0_level_1,Float64,Float64
1,2.26336,9.61538
2,3.8728,48.0769
3,2.40313,11.0577
4,2.63493,13.9423
5,3.36198,28.8462
6,2.46222,11.7308
7,2.95651,19.2308
8,2.95651,19.2308
9,2.48491,12.0
10,2.95651,19.2308


Firt, we split the data first and normalize it.

In [379]:
Random.seed!(1234) 
training = sample(1:nrow(data), Int(floor(nrow(data)*(3/4))), replace = false)

3862-element Vector{Int64}:
 2455
 3109
 4807
 3107
  617
 2230
 4697
  755
  771
 3729
 2063
  528
 3238
    ⋮
 4057
  973
 4183
 1230
 3228
 3687
 4789
 4060
 2723
 3610
  899
 1273

In [380]:
data_train = data[training,1:16]
data_test = data[Not(training),1:16]
data_train

Unnamed: 0_level_0,wage,lwage,sex,shs,hsg,scl,clg,ad,mw
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,12.0192,2.48651,0.0,0.0,0.0,0.0,1.0,0.0,1.0
2,10.2857,2.33076,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,10.5769,2.35867,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,17.3077,2.85115,1.0,0.0,0.0,1.0,0.0,0.0,0.0
5,25.641,3.24419,0.0,0.0,0.0,1.0,0.0,0.0,0.0
6,42.3077,3.74497,0.0,0.0,0.0,0.0,1.0,0.0,1.0
7,15.8654,2.76414,0.0,1.0,0.0,0.0,0.0,0.0,0.0
8,72.1154,4.27827,1.0,0.0,0.0,0.0,1.0,0.0,0.0
9,35.7143,3.57555,1.0,0.0,0.0,0.0,0.0,1.0,0.0
10,15.3846,2.73337,1.0,0.0,1.0,0.0,0.0,0.0,0.0


In [381]:
size(data_train), size(data_test)

((3862, 16), (1288, 16))

In [382]:
# normalize the data

mean_1 = mean.(eachcol(data_train))
mean_1 = [names(data_train) mean_1]

16×2 Matrix{Any}:
 "wage"   23.3849
 "lwage"   2.97164
 "sex"     0.44407
 "shs"     0.0222683
 "hsg"     0.24754
 "scl"     0.283014
 "clg"     0.310979
 "ad"      0.136199
 "mw"      0.262558
 "so"      0.296737
 "we"      0.215691
 "ne"      0.225013
 "exp1"   13.6699
 "exp2"    2.97307
 "exp3"    8.05316
 "exp4"   24.4216

In [383]:
std_1 = std.(eachcol(data_train))
std_1 = [names(data_train) std_1]

16×2 Matrix{Any}:
 "wage"   20.445
 "lwage"   0.566839
 "sex"     0.496926
 "shs"     0.147574
 "hsg"     0.431639
 "scl"     0.450522
 "clg"     0.462954
 "ad"      0.343044
 "mw"      0.440081
 "so"      0.456879
 "we"      0.411354
 "ne"      0.417645
 "exp1"   10.5105
 "exp2"    3.94797
 "exp3"   14.2448
 "exp4"   52.4952

In [388]:
df = DataFrame()
for i in 1:size(data_train)[2]
     p = (data_train[!, i] .- mean_1[i,2]) / std_1[i,2]
     colname = names(data_train)[i]
     df[!,colname] = p
end
data_train = df
data_train

Unnamed: 0_level_0,wage,lwage,sex,shs,hsg,scl,clg,ad
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,-0.555914,-0.855849,-0.893634,-0.150896,-0.573489,-0.628192,1.48831,-0.39703
2,-0.640703,-1.13062,-0.893634,-0.150896,1.74326,-0.628192,-0.671727,-0.39703
3,-0.62646,-1.08137,-0.893634,-0.150896,-0.573489,1.59146,-0.671727,-0.39703
4,-0.297246,-0.212557,1.11874,-0.150896,-0.573489,1.59146,-0.671727,-0.39703
5,0.110351,0.480836,-0.893634,-0.150896,-0.573489,1.59146,-0.671727,-0.39703
6,0.925545,1.36429,-0.893634,-0.150896,-0.573489,-0.628192,1.48831,-0.39703
7,-0.367792,-0.36606,-0.893634,6.62538,-0.573489,-0.628192,-0.671727,-0.39703
8,2.38349,2.30512,1.11874,-0.150896,-0.573489,-0.628192,1.48831,-0.39703
9,0.603051,1.06541,1.11874,-0.150896,-0.573489,-0.628192,-0.671727,2.51805
10,-0.391307,-0.420346,1.11874,-0.150896,1.74326,-0.628192,-0.671727,-0.39703


In [389]:
df = DataFrame()
for i in 1:size(data_test)[2]
     p = (data_test[!, i] .- mean_1[i,2]) / std_1[i,2]
     colname = names(data_test)[i]
     df[!,colname] = p
end
data_test = df
data_test

Unnamed: 0_level_0,wage,lwage,sex,shs,hsg,scl,clg
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,-0.67349,-1.24951,1.11874,-0.150896,-0.573489,-0.628192,1.48831
2,-0.461853,-0.594011,1.11874,-0.150896,-0.573489,-0.628192,-0.671727
3,-0.555914,-0.855849,1.11874,-0.150896,-0.573489,1.59146,-0.671727
4,-0.498805,-0.692294,-0.893634,-0.150896,1.74326,-0.628192,-0.671727
5,-0.273731,-0.164221,1.11874,-0.150896,-0.573489,-0.628192,-0.671727
6,-0.346322,-0.317912,1.11874,-0.150896,-0.573489,1.59146,-0.671727
7,-0.556854,-0.858674,-0.893634,-0.150896,-0.573489,1.59146,-0.671727
8,-0.181686,0.0131863,-0.893634,-0.150896,-0.573489,-0.628192,1.48831
9,-0.558527,-0.863707,1.11874,-0.150896,1.74326,-0.628192,-0.671727
10,0.0319668,0.366979,1.11874,-0.150896,-0.573489,-0.628192,1.48831


In [392]:
typeof(data_train), typeof(data_test)

(DataFrame, DataFrame)

Then, we construct the inputs for our network.

In [401]:
formula_basic = @formula(lwage ~ sex + exp1 + shs + hsg+ scl + clg + mw + so + we)
println(formula_basic)

lwage ~ sex + exp1 + shs + hsg + scl + clg + mw + so + we


In [403]:
model_X_basic_train = ModelMatrix(ModelFrame(formula_basic,data_train)).m
model_X_basic_test = ModelMatrix(ModelFrame(formula_basic,data_test)).m

1288×10 Matrix{Float64}:
 1.0   1.11874   -0.634592   -0.150896  …  -0.596613  -0.649488  -0.524344
 1.0   1.11874    1.07799    -0.150896     -0.596613  -0.649488  -0.524344
 1.0   1.11874   -0.777306   -0.150896     -0.596613  -0.649488  -0.524344
 1.0  -0.893634  -0.539448   -0.150896     -0.596613  -0.649488  -0.524344
 1.0   1.11874   -0.254019   -0.150896     -0.596613  -0.649488  -0.524344
 1.0   1.11874    1.03041    -0.150896  …  -0.596613  -0.649488  -0.524344
 1.0  -0.893634  -0.87245    -0.150896     -0.596613  -0.649488  -0.524344
 1.0  -0.893634   1.17313    -0.150896     -0.596613  -0.649488  -0.524344
 1.0   1.11874    0.0314105  -0.150896     -0.596613  -0.649488  -0.524344
 1.0   1.11874    1.64884    -0.150896     -0.596613  -0.649488  -0.524344
 1.0   1.11874   -0.254019   -0.150896  …  -0.596613  -0.649488  -0.524344
 1.0  -0.893634   1.26827    -0.150896     -0.596613  -0.649488  -0.524344
 1.0   1.11874    0.364412   -0.150896     -0.596613  -0.649488  -0.524344


In [407]:
Y_train = data_train[!,"lwage"]
Y_test = data_test[!,"lwage"]

1288-element Vector{Float64}:
 -1.2495116701816313
 -0.5940111369079969
 -0.8558489593188235
 -0.6922940066202594
 -0.16422085652454593
 -0.3179120121732921
 -0.8586738886527183
  0.013186299642166738
 -0.8637071816306149
  0.36697911041067455
  1.3178594848348544
 -1.1634376242266362
 -1.7570311742901377
  ⋮
 -1.2495116701816313
  0.43617103718140365
 -0.07134846692710259
 -0.3847866913343217
 -1.2495116701816313
 -0.8558489593188235
  0.5351222789388115
  0.25782050367552267
 -0.6062199606652643
 -0.4923631674195659
  1.1961444692773646
  0.9241967739720466

In [408]:
model_X_basic_train

3862×10 Matrix{Float64}:
 1.0  -0.893634   1.07799    -0.150896  …   1.67569   -0.649488  -0.524344
 1.0  -0.893634  -0.444305   -0.150896     -0.596613   1.53928   -0.524344
 1.0  -0.893634   1.12556    -0.150896     -0.596613  -0.649488   1.90665
 1.0   1.11874   -0.920021   -0.150896     -0.596613   1.53928   -0.524344
 1.0  -0.893634   2.17213    -0.150896     -0.596613  -0.649488  -0.524344
 1.0  -0.893634  -0.158876   -0.150896  …   1.67569   -0.649488  -0.524344
 1.0  -0.893634   0.411983    6.62538      -0.596613  -0.649488   1.90665
 1.0   1.11874   -0.444305   -0.150896     -0.596613  -0.649488  -0.524344
 1.0   1.11874   -1.20545    -0.150896     -0.596613  -0.649488  -0.524344
 1.0   1.11874   -0.349162   -0.150896     -0.596613   1.53928   -0.524344
 1.0  -0.893634   0.935271   -0.150896  …   1.67569   -0.649488  -0.524344
 1.0   1.11874    0.507126   -0.150896     -0.596613  -0.649488  -0.524344
 1.0  -0.893634   0.697413   -0.150896     -0.596613   1.53928   -0.524344
 ⋮

### Neural Networks

First, we need to determine the structure of our network. We are using the R package *keras* to build a simple sequential neural network with three dense layers.

In [13]:
dim(model_X_basic_train)[2]

In [14]:
library(keras)

build_model <- function() {
  model <- keras_model_sequential() %>% 
    layer_dense(units = 20, activation = "relu", 
                input_shape = dim(model_X_basic_train)[2])%>% 
    layer_dense(units = 10, activation = "relu") %>% 
    layer_dense(units = 1) 
  
  model %>% compile(
    optimizer = optimizer_adam(lr = 0.005),
    loss = "mse", 
    metrics = c("mae")
  )
}

"package 'keras' was built under R version 4.0.5"


Let us have a look at the structure of our network in detail.

In [15]:
model <- build_model()

Loaded Tensorflow version 2.7.0



ERROR: Error: 


It is worth to notice that we have in total $441$ trainable parameters.

Now, let us train the network. Note that this takes some computation time. Thus, we are using gpu to speed up. The exact speed-up varies based on a number of factors including model architecture, batch-size, input pipeline complexity, etc.

In [None]:
# training the network 
num_epochs <- 1000
model %>% fit(model_X_basic_train, Y_train,
                    epochs = num_epochs, batch_size = 100, verbose = 0)

After training the neural network, we can evaluate the performance of our model on the test sample.

In [None]:
# evaluating the performnace
model %>% evaluate(model_X_basic_test, Y_test, verbose = 0)

In [None]:
# Calculating the performance measures
pred.nn <- model %>% predict(model_X_basic_test)
MSE.nn = summary(lm((Y_test-pred.nn)^2~1))$coef[1:2]
R2.nn <- 1-MSE.nn[1]/var(Y_test)
# printing R^2
cat("R^2 of the neural network:",R2.nn)

In [None]:
MSE.nn = summary(lm((Y_test-pred.nn)^2~1))$coef[1:2]
MSE.nn