<font size=5>Bonus exercise if time allows</font>

<font size=4>The bonus exercise is to a high extent identical to the previous one. The only difference is that to overcome the issue of overfitting additionally to early stopping we will see how to save the best model during training.<br>Without saving the best model during training (i.e. the one that had the lowest validation loss in our case), it is going to be trained <b>n</b> times (n=number of epochs used as the [patience parameter](03_Exercise.ipynb#patience)) after it actually reached the best performance. In other words early stopping means that the validation loss is being monitored (saved for each epoch) and if the loss is not getting smaller e.g. during the next 20 epochs (n or patience=20) stop the training. The problem with this is that the model has been trained already 20 times after the best validation loss was achieved, therefore it is very likely that the resulting model won't be as good as 20 epochs earlier.<br>To help solve this issue we are going to save the best model during the training period (still using early stopping).<br><br>The first part of the script we are going to run in one batch. It is the same as previously apart from unnecessary printing of information like.</font>
```R
head(feat.train,2)
```

In [None]:
# cell 1.
###########################################################
############ IMPORTING AND PRE-PROCESSING DATA ############
###########################################################

# loading sample plot information based on field observations
# each column represents some sample plot specific information and each row a sample plot
# we are interested in the columns "v", "h" and "d" (total growing stock, mean height and mean diamater)

# training data (the Neural Network (NN) will be trained with this dataset)
sp.data.train <- read.csv("../data/AV.leaf.on.train.csv",as.is=T)
# validation data (this dataset helps to avoid overfitting during training)
sp.data.val <- read.csv("../data/AV.leaf.on.val.csv",as.is=T)
# test data (this dataset is used to evaluate the trained NN on data that is unknown to the NN)
sp.data.test <- read.csv("../data/AV.leaf.on.test.csv",as.is=T)

# loading remote sensing features calculated for each sample plot from LiDAR (laser) data
# each column (apart from the sample plot ID) represents a feature and each row a sample plot

# importing LiDAR features for the entire data (training, validation and test)
feat <- readRDS("../data/las.feat.AV-MK.leaf.on.RDS")

# separating the features into training, validation and test sets based on the sample plot IDs of
# sample plot information imoprted above
feat.train <- feat[feat$sampleplotid%in%sp.data.train$sampleplotid,]
feat.val <- feat[feat$sampleplotid%in%sp.data.val$sampleplotid,]
feat.test <- feat[feat$sampleplotid%in%sp.data.test$sampleplotid,]

# pre-processing feature data

# at this point the column of sample plot ID can be removed
feat.train$sampleplotid <- NULL; feat.val$sampleplotid <- NULL; feat.test$sampleplotid <- NULL

# some of the features (columns) might have no variation (same value is repeated in every row)
# such information is not helpful and should be removed (standard deviation is 0 or NaN/NA)
orig.nfeat <- ncol(feat.train)
feat.train <- feat.train[,apply(feat.train,2,function(x) !(sd(x)==0|is.na(sd(x))))]
feat.val <- feat.val[,apply(feat.val,2,function(x) !(sd(x)==0|is.na(sd(x))))]
feat.test <- feat.test[,apply(feat.test,2,function(x) !(sd(x)==0|is.na(sd(x))))]

# keeping only those features that are present in all 3 datsets
# extracting column names common in all 3 sets
feat.common <- Reduce(intersect,list(names(feat.train),names(feat.val),names(feat.test)))
# subsetting datasets
feat.train <- feat.train[,feat.common]
feat.val <- feat.val[,feat.common]
feat.test <- feat.test[,feat.common]; rm(feat.common)

# scaling is done so that after scaling each column's mean equals 0 and standard deviation equals 1
# the attributes "center" and "scale" of the training set is going to be used to scale the validation and test sets
# "center" is the mean and "scale" is the standard deviation of each column in the training data
train.data <- scale(feat.train)
mean.train <- attr(train.data,"scaled:center")
sd.train <- attr(train.data,"scaled:scale")
val.data <- scale(feat.val,center=mean.train,scale=sd.train)
test.data <- scale(feat.test,center=mean.train,scale=sd.train)

# pre-processing sample plot data

# creating variable for forest attributes we are going to use
for.attrs <- c("v","h","d")

# selecting columns
sp.data.train <- sp.data.train[,for.attrs]
sp.data.val <- sp.data.val[,for.attrs]
sp.data.test <- sp.data.test[,for.attrs]

# scaling data the same way as above
train.labels <- scale(sp.data.train)
mean.train <- attr(train.labels,"scaled:center")
sd.train <- attr(train.labels,"scaled:scale")
# Keras' fit function doesn't accept the output of scale() for labels, it needs to be converted to data frame
val.labels <- as.data.frame(scale(sp.data.val,center=mean.train,scale=sd.train))
test.labels <- as.data.frame(scale(sp.data.test,center=mean.train,scale=sd.train))
train.labels <- as.data.frame(train.labels)

###########################################################
############ CREATING AND TRAINING THE NETWORK ############
###########################################################

# loading keras library
library(keras)

# loading custom functions stored in a separate R script (ML_with_R/functions/keras_tf_funcs.R)
source("../functions/keras_tf_funcs.R")

# setting the number of neurons in the hidden layer
# uncomment the line(s) that you want to use (don't uncomment lines starting with # #)

# # Ns/(a*(Ni+No)) (Ns: number of samples; a: scaling factor (2-10);
# # Ni: number of input neurons (features); No: number of output neurons (dependent variables))
# a <- 2
# n.neur <- ceiling(nrow(train.data)/(a*(ncol(train.data)+ncol(train.labels))))
# paste0("Number of neurons: ",n.neur)

# # number of neurons of hidden layer can be Ni*2/3+No
# n.neur <- ceiling(ncol(train.data)*(2/3)+ncol(train.labels))
# paste0("Number of neurons: ",n.neur)

# # number of neurons of hidden layer can be (Ni+No)/2
# n.neur <- ceiling((ncol(train.data)+ncol(train.labels))/2)
# paste0("Number of neurons: ",n.neur)

# # set the number of neurons to any value you'd like to test
# # (too big values (over 200) will slow down the training process!)
n.neur <- 6

# designing the network

# creating input layer; the shape parameter has to be set
# to the number of columns (number of features) in the feature table (see cell 4.)
inputs <- layer_input(shape=ncol(train.data))

# creating network structure
# we are going to calculate predictions for multiple forest attributes (v, h, d)
for.attrs <- c("v","h","d")
# therefore we are iterating through these attribute names and adding them one-by-one to the network
preds <- lapply(for.attrs,function(for.attr) {
  inputs %>%
    # adding hidden layer
    # number of units: this parameter should be tested with different values
    # activation: different activation functions can be tested
    #             for in-built functions use function name in quotes (e.g. "relu", "elu")
    #             for custom functions use function name without quotes (e.g. swish_activation)
    layer_dense(units=n.neur,activation="relu") %>%
    # layer_dense(units=n.neur,activation=swish_activation) %>%
    # adding output layer (number of units 1 as output is one value for each attribute)
    layer_dense(units=1,name=for.attr)
})

# creating the model
tf.model <- keras_model(inputs=inputs,outputs=preds)

# finalizing and configuring the model
# some parameters for configuration:

# weights for forest attributes
# here we can set how much each forest attribute is affecting the overall accuracy score
# the order is the same as for.attrs <- c("v","h","d")
# can be experimented with (e.g. c(0.7,0.15,0.15))
preds.w <- c(0.6,0.2,0.2)

# the optimizer function to be used (see the Optimizers section of 02_Simple_Neural_Networks)
# (for further options go to https://keras.rstudio.com/reference/index.html#section-optimizers)
opt <- "adam"
# opt <- "rmsprop"

# loss function to be used (see the Loss section of 02_Simple_Neural_Networks)
# (for further options go to https://keras.rstudio.com/reference/index.html#section-losses)
# Mean Squared Error is considered to be a good choice for regression problems and shouldn't be changed
loss <- "mean_squared_error"

# compiling the model
tf.model %>% compile(
  optimizer=opt,
  loss=loss,
  loss_weights=preds.w,
)

# parameters for fit function
batch_size=25
patience=20
epochs=200

# setting up early stopping against overfitting
early.stop <- callback_early_stopping(monitor="val_loss",patience=patience)

# printing parameters used during model training
paste0("Number of neurons: ",n.neur)
paste0("Weights for forest attributes (v, h, d): ",paste0(preds.w,collapse=", "))
paste0("Optimizaiton function used: ",opt)
paste0("Batch size used: ",batch_size)

<font size=4>In the next cell we will create the function that saves the best model.</font>

In [None]:
# cell 2.
# first creating the output file name for the model
# out.name will make it possible to save best models for various parameter combinations
# set it to reflect the parameters that have been used to train the model (see cell 14. in 03_Exercise.ipynb)
out.name <- "adam_swish_bs25_6.2.2"
# out.name <- "adam_relu_bs25_6.2.2"
filepath <- paste0("best.model.",out.name,".hdf5")

# creating checkpoint callback
# model saved after each epoch if the validation loss is smaller than the previous one
# save_weights_only: saving the full model if true
# mode: the target is to minimize validation loss
cp_callback <- callback_model_checkpoint(
  filepath=filepath,monitor="val_loss",
  save_weights_only=F,save_best_only=T,
  mode="min"
)

# fit the model (same as train the model)
# note that the model saving function is added to the parameter callbacks
history <- fit(object=tf.model,x=train.data,y=train.labels,batch_size=batch_size,
               epochs=epochs,validation_data=list(val.data,val.labels),
               verbose=2,callbacks=list(early.stop,cp_callback))

In [None]:
# cell 3.

# plotting the training history
# right click on the picture and select "Open image in new tab"
# the number of epochs needs to be fixed when early stopping is used
history$params$epochs <- length(history$metrics$loss)
plot(history)

<font size=4>Now we are going to load the saved best model and calculate its performance for training, validation and test datasets.</font>

In [None]:
# cell 4.
# calculating error estimates, evaluating model performance

# file name of best model
best.model.name <- paste0("best.model.",out.name,".hdf5")

# loading the model
# if you have used a custom object (e.g. swish_activation) use the custom_objects parameter
# if you haven't used any custom objects the custom_objects parameter will be ignored
best.model <- load_model_hdf5(best.model.name,custom_objects=c("python_function"=swish_activation))

# calculating predictions with the trained model
train.predictions <- best.model %>% predict(train.data); train.predictions <- as.data.frame(Reduce(cbind,train.predictions))
val.predictions <- best.model %>% predict(val.data); val.predictions <- as.data.frame(Reduce(cbind,val.predictions))
test.predictions <- best.model %>% predict(test.data); test.predictions <- as.data.frame(Reduce(cbind,test.predictions))

# de-scaling predictions (when forest attribute data is scaled in the model, the predictions are
# going to be scaled as well and need to be de-scaled using the same method used for scaling the 
# original forest attributes)
train.predictions <- sapply(1:ncol(train.predictions),function(i) (train.predictions[i]*sd.train[i])+mean.train[i])
train.predictions <- as.data.frame(do.call(cbind,train.predictions))

val.predictions <- sapply(1:ncol(val.predictions),function(i) (val.predictions[i]*sd.train[i])+mean.train[i])
val.predictions <- as.data.frame(do.call(cbind,val.predictions))

test.predictions <- sapply(1:ncol(test.predictions),function(i) (test.predictions[i]*sd.train[i])+mean.train[i])
test.predictions <- as.data.frame(do.call(cbind,test.predictions))

# calculating relative RMSE and bias for all datasets using predifened functions (see cell 6.)
train.rmse <- rel.rmse(sp.data.train,train.predictions)
train.bias <- rel.bias(sp.data.train,train.predictions)

val.rmse <- rel.rmse(sp.data.val,val.predictions)
val.bias <- rel.bias(sp.data.val,val.predictions)

test.rmse <- rel.rmse(sp.data.test,test.predictions)
test.bias <- rel.bias(sp.data.test,test.predictions)

# printing results
results <- rbind(train.rmse,val.rmse,test.rmse)
row.names(results) <- c("training RMSE %","validation RMSE %","test RMSE %")
results

<font size=4>Save the results to compare them to other results of this exercise and/or to the results of the previous notebook. One interesting option is to run the training with the same parameter combination in both notebooks and compare the results. Are the results of the bonus exercise in deed better than the others? Bear in mind that randomness might also influence the results.</font>

<font size=4>Let's clean up after the training is done, results and model is saved. Without doing so errors might occur during consecutive trainings.</font>

In [None]:
# cell 5.

# let's clean up after training; if this is not done errors will occur
# when you try to start building and training a new model
rm(tf.model,best.model,best.model.name)
k_clear_session()

<font size=5><b>Exercise</b></font><br><br>

<font size=4>Just like in the previous exercise you can try different parameter combinations (don't forget to change ```out.name``` in cell 2.!). Compare the results of this exercise between each other and also to the results of the previous notebook.</font><br>
<font size=4>Try out different options for some parameters (optimization and activation function, number of neurons in the hidden layer, batch size, weights for forest attributes, number of epochs). Best practice is to change one parameter at the time, compare the results to the previous ones and so on.<br><b>In order to enable line numbering in Code cells go to View > Toggle Line Numbers.</b></font><br>

* <font size=4>Go back to cell 1./lines 87-106 and try other options for number of neurons in the hidden layer</font><br><br>
* <font size=4>Go back to 1./lines 125-126 and try RELU activation (you can also try "elu" instead of "relu")</font>
```R
layer_dense(units=n.neur,activation="relu") %>%
# layer_dense(units=n.neur,activation=swish_activation) %>%
```
* <font size=4>Go back to cell 1./lines 140-141 and try another set of weights for forest attributes</font>
```R
# can be experimented with (e.g. c(0.7,0.15,0.15))
preds.w <- c(0.7,0.15,0.15)
```
* <font size=4>Go back to cell 1./lines 145-146 and try some other optimizer function</font>
```R
# opt <- "adam"
opt <- "rmsprop"
```
* <font size=4>Go back to cell 1./lines 161-163 and try some other options for <b>batch size</b> (shouldn't be set to more than 1044 that is the number of samples in the training data), <b>number of epochs</b> (the bigger number the longer the training will run), <b>patience</b> (you can also try to set it to the same value as the number of epochs; this way no early stopping is done)</font>
```R
# parameters for fit function
batch_size=25
patience=20
epochs=200
```

<font size=4><b>Remember to change the file name for your new model in cell 2./line 5, otherwise the already existing file will be replaced!</b></font>