

This notebook contains an example for teaching.


# A Simple Case Study using Wage Data from 2015 - proceeding

So far we considered many machine learning method, e.g Lasso and Random Forests, to build a predictive model. In this lab, we extend our toolbox by predicting wages by a neural network.

## Data preparation

Again, we consider data from the U.S. March Supplement of the Current Population Survey (CPS) in 2015.

In [49]:
# Sys.setenv(RETICULATE_PYTHON = "C:/Users/MSI-NB/anaconda3/envs/tensorflow_2")

In [2]:
load("wage2015_subsample_inference.Rdata")
Z <- subset(data,select=-c(lwage,wage)) # regressors

Firt, we split the data first and normalize it.

In [3]:
nrow(data)

In [4]:
set.seed(1234)
training <- sample(nrow(data), nrow(data)*(3/4), replace=FALSE)
dim(data)

In [5]:
data_train <- data[training,1:16]
data_test <- data[-training,1:16]
data_train

Unnamed: 0_level_0,wage,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5256,42.307692,3.744969,0,0,0,1,0,0,0,0,0,1,10.5,1.1025,1.157625,1.2155063
3452,19.230769,2.956512,0,0,0,0,1,0,0,0,0,1,14.0,1.9600,2.744000,3.8416000
15822,34.965035,3.554349,0,0,0,0,1,0,0,1,0,0,6.0,0.3600,0.216000,0.1296000
4887,4.808173,1.570317,0,0,0,0,1,0,0,0,0,1,11.0,1.2100,1.331000,1.4641000
29065,87.019231,4.466129,0,0,0,0,1,0,0,0,1,0,15.0,2.2500,3.375000,5.0625000
16889,37.692308,3.629456,0,0,0,0,1,0,0,1,0,0,7.0,0.4900,0.343000,0.2401000
12548,21.978022,3.090043,0,0,0,1,0,0,1,0,0,0,12.0,1.4400,1.728000,2.0736000
18765,24.725275,3.207826,0,0,1,0,0,0,0,1,0,0,17.0,2.8900,4.913000,8.3521000
15980,76.923077,4.342806,1,0,0,0,0,1,0,1,0,0,19.0,3.6100,6.859000,13.0321000
13860,8.333333,2.120264,1,0,0,0,1,0,1,0,0,0,10.0,1.0000,1.000000,1.0000000


In [6]:
# normalize the data
mean <- apply(data_train, 2, mean)
std <- apply(data_train, 2, sd)

In [7]:
data_train <- scale(data_train, center = mean, scale = std)
data_test <- scale(data_test, center = mean, scale = std)
data_test

Unnamed: 0,wage,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4
30,-0.56512120,-0.88688853,1.1001515,-0.156195,-0.5774748,-0.6177817,1.474932,-0.394843,-0.587042,-0.6531165,-0.5223374,1.828481,-1.20471378,-0.754611597,-0.57197944,-0.47308794
77,-0.55112463,-0.84443200,-0.9087303,-0.156195,1.7312289,-0.6177817,-0.677822,-0.394843,-0.587042,-0.6531165,-0.5223374,1.828481,1.51284700,1.470848545,1.26986669,1.01984791
119,-0.71441799,-1.41856114,-0.9087303,-0.156195,-0.5774748,1.6182755,-0.677822,-0.394843,-0.587042,-0.6531165,-0.5223374,1.828481,-0.97044130,-0.726762402,-0.56912276,-0.47281320
129,-0.49447183,-0.68240305,-0.9087303,-0.156195,1.7312289,-0.6177817,-0.677822,-0.394843,-0.587042,-0.6531165,-0.5223374,1.828481,-0.54875083,-0.598656104,-0.53711950,-0.46554031
164,-0.31784839,-0.25637682,-0.9087303,-0.156195,-0.5774748,-0.6177817,1.474932,-0.394843,-0.587042,-0.6531165,-0.5223374,1.828481,-0.54875083,-0.598656104,-0.53711950,-0.46554031
261,-0.64443512,-1.14915079,-0.9087303,-0.156195,1.7312289,-0.6177817,-0.677822,-0.394843,-0.587042,-0.6531165,-0.5223374,1.828481,1.23172002,1.047540776,0.77070790,0.50642664
280,-0.78440087,-1.73720596,1.1001515,-0.156195,-0.5774748,1.6182755,-0.677822,-0.394843,-0.587042,-0.6531165,-0.5223374,1.828481,0.34148459,0.001029903,-0.20643619,-0.30022407
368,-0.72159573,-1.44869478,-0.9087303,-0.156195,-0.5774748,1.6182755,-0.677822,-0.394843,-0.587042,-0.6531165,-0.5223374,1.828481,-0.87673231,-0.706958530,-0.56583120,-0.47233399
433,-0.16855160,0.03712069,1.1001515,-0.156195,-0.5774748,-0.6177817,-0.677822,2.531997,-0.587042,-0.6531165,-0.5223374,1.828481,0.48204808,0.136562653,-0.10413319,-0.23289086
445,-0.62110750,-1.06784739,1.1001515,-0.156195,1.7312289,-0.6177817,-0.677822,-0.394843,-0.587042,-0.6531165,-0.5223374,1.828481,0.48204808,0.136562653,-0.10413319,-0.23289086


In [8]:
data_train <- as.data.frame(data_train)
data_test <- as.data.frame(data_test)

In [9]:
data_train

Unnamed: 0_level_0,wage,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5256,0.91851566,1.35499018,-0.9087303,-0.156195,-0.5774748,1.6182755,-0.677822,-0.394843,-0.587042,-0.6531165,-0.5223374,1.8284805,-0.31447835,-0.4841650,-0.4930756,-0.45068639
3452,-0.20121027,-0.02300296,-0.9087303,-0.156195,-0.5774748,-0.6177817,1.474932,-0.394843,-0.587042,-0.6531165,-0.5223374,1.8284805,0.01350312,-0.2718922,-0.3848546,-0.40228399
15822,0.56223923,1.02184148,-0.9087303,-0.156195,-0.5774748,-0.6177817,1.474932,-0.394843,-0.587042,1.5307240,-0.5223374,-0.5467606,-0.73616882,-0.6679697,-0.5573123,-0.47070109
4887,-0.90101566,-2.44566576,-0.9087303,-0.156195,-0.5774748,-0.6177817,1.474932,-0.394843,-0.587042,-0.6531165,-0.5223374,1.8284805,-0.26762386,-0.4575535,-0.4812481,-0.44610448
29065,3.08798466,2.61536757,-0.9087303,-0.156195,-0.5774748,-0.6177817,1.474932,-0.394843,-0.587042,-0.6531165,1.9139756,-0.5467606,0.10721211,-0.2001032,-0.3418084,-0.37978118
16889,0.69457047,1.15310740,-0.9087303,-0.156195,-0.5774748,-0.6177817,1.474932,-0.394843,-0.587042,1.5307240,-0.5223374,-0.5467606,-0.64245983,-0.6357884,-0.5486485,-0.46866442
12548,-0.06790957,0.21037091,-0.9087303,-0.156195,-0.5774748,1.6182755,-0.677822,-0.394843,1.703015,-0.6531165,-0.5223374,-0.5467606,-0.17391486,-0.4006174,-0.4541651,-0.43487058
18765,0.06539114,0.41622125,-0.9087303,-0.156195,1.7312289,-0.6177817,-0.677822,-0.394843,-0.587042,1.5307240,-0.5223374,-0.5467606,0.29463010,-0.0416722,-0.2368875,-0.31914948
15980,2.59810456,2.39983461,1.1001515,-0.156195,-0.5774748,-0.6177817,-0.677822,2.531997,-0.587042,1.5307240,-0.5223374,-0.5467606,0.48204808,0.1365627,-0.1041332,-0.23289086
13860,-0.72996974,-1.48452021,1.1001515,-0.156195,-0.5774748,-0.6177817,1.474932,-0.394843,1.703015,-0.6531165,-0.5223374,-0.5467606,-0.36133285,-0.5095387,-0.5038286,-0.45465846


Then, we construct the inputs for our network.

In [10]:
X_basic <-  "sex + exp1 + shs + hsg+ scl + clg + mw + so + we"
formula_basic <- as.formula(paste("lwage", "~", X_basic))
formula_basic

lwage ~ sex + exp1 + shs + hsg + scl + clg + mw + so + we

In [11]:
model_X_basic_train <- model.matrix(formula_basic,data_train)
model_X_basic_test <- model.matrix(formula_basic,data_test)

Y_train <- data_train$lwage
Y_test <- data_test$lwage

In [12]:
model_X_basic_train

Unnamed: 0,(Intercept),sex,exp1,shs,hsg,scl,clg,mw,so,we
5256,1,-0.9087303,-0.31447835,-0.156195,-0.5774748,1.6182755,-0.677822,-0.587042,-0.6531165,-0.5223374
3452,1,-0.9087303,0.01350312,-0.156195,-0.5774748,-0.6177817,1.474932,-0.587042,-0.6531165,-0.5223374
15822,1,-0.9087303,-0.73616882,-0.156195,-0.5774748,-0.6177817,1.474932,-0.587042,1.5307240,-0.5223374
4887,1,-0.9087303,-0.26762386,-0.156195,-0.5774748,-0.6177817,1.474932,-0.587042,-0.6531165,-0.5223374
29065,1,-0.9087303,0.10721211,-0.156195,-0.5774748,-0.6177817,1.474932,-0.587042,-0.6531165,1.9139756
16889,1,-0.9087303,-0.64245983,-0.156195,-0.5774748,-0.6177817,1.474932,-0.587042,1.5307240,-0.5223374
12548,1,-0.9087303,-0.17391486,-0.156195,-0.5774748,1.6182755,-0.677822,1.703015,-0.6531165,-0.5223374
18765,1,-0.9087303,0.29463010,-0.156195,1.7312289,-0.6177817,-0.677822,-0.587042,1.5307240,-0.5223374
15980,1,1.1001515,0.48204808,-0.156195,-0.5774748,-0.6177817,-0.677822,-0.587042,1.5307240,-0.5223374
13860,1,1.1001515,-0.36133285,-0.156195,-0.5774748,-0.6177817,1.474932,1.703015,-0.6531165,-0.5223374


### Neural Networks

First, we need to determine the structure of our network. We are using the R package *keras* to build a simple sequential neural network with three dense layers.

In [13]:
dim(model_X_basic_train)[2]

In [14]:
library(keras)

build_model <- function() {
  model <- keras_model_sequential() %>% 
    layer_dense(units = 20, activation = "relu", 
                input_shape = dim(model_X_basic_train)[2])%>% 
    layer_dense(units = 10, activation = "relu") %>% 
    layer_dense(units = 1) 
  
  model %>% compile(
    optimizer = optimizer_adam(lr = 0.005),
    loss = "mse", 
    metrics = c("mae")
  )
}

"package 'keras' was built under R version 4.0.5"


Let us have a look at the structure of our network in detail.

In [15]:
model <- build_model()

Loaded Tensorflow version 2.7.0



ERROR: Error: 


It is worth to notice that we have in total $441$ trainable parameters.

Now, let us train the network. Note that this takes some computation time. Thus, we are using gpu to speed up. The exact speed-up varies based on a number of factors including model architecture, batch-size, input pipeline complexity, etc.

In [None]:
# training the network 
num_epochs <- 1000
model %>% fit(model_X_basic_train, Y_train,
                    epochs = num_epochs, batch_size = 100, verbose = 0)

After training the neural network, we can evaluate the performance of our model on the test sample.

In [None]:
# evaluating the performnace
model %>% evaluate(model_X_basic_test, Y_test, verbose = 0)

In [None]:
# Calculating the performance measures
pred.nn <- model %>% predict(model_X_basic_test)
MSE.nn = summary(lm((Y_test-pred.nn)^2~1))$coef[1:2]
R2.nn <- 1-MSE.nn[1]/var(Y_test)
# printing R^2
cat("R^2 of the neural network:",R2.nn)

In [None]:
MSE.nn = summary(lm((Y_test-pred.nn)^2~1))$coef[1:2]
MSE.nn