In [1]:
# This script performs a Bayesian Additive Regression Trees analysis of the nanoQSAR data.
#
# Created: 01/02/2018 Wilson Melendez
# Revised: 

# Set the location of scripts.
working_directory <- getwd()

# Set working directory
setwd(working_directory)

In [2]:
# Load external functions into this script.
source("runJarFile.R")
source("extractNumericColumns.R")
source("extractXcolumns.R")
source("extractYcolumn.R")
source("getRecordswithResults.R")
source("removeColumnsWithAllNAs.R")
source("removeColumnsWithOneRepeatedValue.R")

In [3]:
# Define string with location of Jar File
jarFolder <- working_directory

# Call function that will run Jar file.
runJarFile(jarFolder)

# Read in CSV file.
filename <- paste(jarFolder, "/nanoQSAR.csv", sep="")
nanoQSARdata <- read.csv(filename, stringsAsFactors=FALSE)

# Extract numeric columns
numericData <- extractNumericColumns(nanoQSARdata)

# Get the records with results
trainingData <- getRecordswithResults(numericData)

# Extract the X matrix.
XmatrixOrig <- extractXcolumns(trainingData)

# Extract results column.
Ymatrix <- extractYcolumn(trainingData)

# Convert Y matrix to numeric 
y = as.numeric(Ymatrix)

# Check whether the X matrix has columns with no values (all NAs), and if so remove those columns.
Xmatrix <- removeColumnsWithAllNAs(XmatrixOrig)

# Check whether the X matrix has columns with only a single value that is repeated throughout the column.
Xmatrix <- removeColumnsWithOneRepeatedValue(Xmatrix)

# Set JAVA_HOME to the location of the JDK in your system.
Sys.setenv("JAVA_HOME"="C:\\Program Files\\Java\\jdk1.8.0_152")

#  Get the location of the JDK
Sys.getenv("JAVA_HOME")

# Load the rJava package -- this is needed by the bartMachine.
library(rJava)

In [4]:
# Allocate memory needed before loading the bartMachine.
# Note that the maximum amount of memory can be set only once at the beginning of the R session (a
# limitation of rJava since only one Java Virtual Machine can be initiated per R session), but the number of
# cores can be respecified at any time.
options(java.parameters = "-Xmx3000m")

# Load the bartMachine package
library(bartMachine)

# Allocate number of cores that will be used by the bartMachine
set_bart_machine_num_cores(4)

# Call the bartMachine
bart_machine <- bartMachine(Xmatrix, y, 
                            num_trees = 200,
                            num_burn_in = 250,
                            num_iterations_after_burn_in = 1000,
                            alpha = 0.95, beta = 2, k = 2, q = 0.9, nu = 3,
                            prob_rule_class = 0.5,
                            mh_prob_steps = c(2.5, 2.5, 4)/9,
                            debug_log = FALSE,
                            run_in_sample = TRUE,
                            s_sq_y = "mse",
                            sig_sq_est = NULL,
                            cov_prior_vec = NULL,
                            use_missing_data = TRUE, 
                            covariates_to_permute = NULL,
                            num_rand_samps_in_library = 10000,
                            use_missing_data_dummies_as_covars = TRUE,
                            replace_missing_data_with_x_j_bar = FALSE,
                            impute_missingness_with_rf_impute = FALSE,
                            impute_missingness_with_x_j_bar_for_lm = TRUE,
                            mem_cache_for_speed = TRUE,
                            serialize = TRUE,
                            seed = NULL,
                            verbose = TRUE)

# Print a summary of the results, which includes R2.
summary(bart_machine)

Loading required package: bartMachineJARs
Loading required package: car
Loading required package: carData
Loading required package: randomForest
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
Loading required package: missForest
Loading required package: foreach
Loading required package: itertools
Loading required package: iterators
Welcome to bartMachine v1.2.3! You have 2.8GB memory available.

If you run out of memory, restart R, and use e.g.
'options(java.parameters = "-Xmx5g")' for 5GB of RAM before you call
'library(bartMachine)'.



bartMachine now using 4 cores.
bartMachine initializing with 200 trees...
bartMachine vars checked...
bartMachine java init...
bartMachine factors created...
bartMachine before preprocess...
bartMachine after preprocess... 55 total features...
bartMachine sigsq estimated...
bartMachine training data finalized...
Now building bartMachine for regression ...Missing data feature ON. Missingness used as covariates. 
evaluating in sample data...done
serializing in order to be saved for future R sessions...done
bartMachine v1.2.3 for regression

Missing data feature ON
training data n = 250 and p = 54 
built in 11.7 secs on 4 cores, 200 trees, 250 burn-in and 1000 post. samples

sigsq est for y beforehand: 340.801 
avg sigsq estimate after burn-in: 191.19108 

in-sample statistics:
 L1 = 2200.69 
 L2 = 42221.04 
 rmse = 13 
 Pseudo-Rsq = 0.785
p-val for shapiro-wilk test of normality of residuals: 0 
p-val for zero-mean noise: 0.87368 



In [5]:
# Make predictions on the training data. Note: this is not necessary in this case because the bart_machine
# object does provide predicted values.  Consider this an example on how to use the "predict" function.
y_hat <- predict(bart_machine, Xmatrix)

# Perform k-fold cross validation using default values.
bart_machine_cv5fold <- k_fold_cv(Xmatrix, y, 
                                  k_folds = 5,
                                  folds_vec = NULL, 
                                  verbose = FALSE, 
                                  num_trees = 200,
                                  num_burn_in = 250,
                                  num_iterations_after_burn_in = 1000,
                                  alpha = 0.95, beta = 2, k = 2, q = 0.9, nu = 3,
                                  prob_rule_class = 0.5,
                                  mh_prob_steps = c(2.5, 2.5, 4)/9,
                                  use_missing_data = TRUE, 
                                  use_missing_data_dummies_as_covars = TRUE,
                                  serialize = TRUE)

# Print R2 and RMSE values.
print(bart_machine_cv5fold$PseudoRsq)
print(bart_machine_cv5fold$rmse)

.....
[1] 0.6109627
[1] 17.48078


In [6]:
# Build a BART-CV model by cross-validating over a grid of hyperparameter choices.
# Warning: this can take a long time to run.
# bartMachine CV win: k: 2 nu, q: 3, 0.99 m: 50
bart_machine_CV <- bartMachineCV(Xmatrix, y,
                                 num_tree_cvs = c(50, 200), 
                                 k_cvs = c(2, 3, 5),
                                 nu_q_cvs = list(c(3, 0.9), c(3, 0.99), c(10, 0.75)), 
                                 k_folds = 5, verbose = FALSE,
                                 num_burn_in = 250,
                                 num_iterations_after_burn_in = 1000,
                                 alpha = 0.95, beta = 2,
                                 prob_rule_class = 0.5,
                                 mh_prob_steps = c(2.5, 2.5, 4)/9,
                                 use_missing_data = TRUE, 
                                 use_missing_data_dummies_as_covars = TRUE,
                                 serialize = TRUE)
              
# Print statistics
print(bart_machine_CV$cv_stats)

# Save model to a file.
saveRDS(bart_machine_CV, file = "bart_machine_CV.rds")

  bartMachine CV try: k: 2 nu, q: 3, 0.9 m: 50 
.....
  bartMachine CV try: k: 2 nu, q: 3, 0.9 m: 200 
.....
  bartMachine CV try: k: 2 nu, q: 3, 0.99 m: 50 
.....
  bartMachine CV try: k: 2 nu, q: 3, 0.99 m: 200 
.....
  bartMachine CV try: k: 2 nu, q: 10, 0.75 m: 50 
.....
  bartMachine CV try: k: 2 nu, q: 10, 0.75 m: 200 
.....
  bartMachine CV try: k: 3 nu, q: 3, 0.9 m: 50 
.....
  bartMachine CV try: k: 3 nu, q: 3, 0.9 m: 200 
.....
  bartMachine CV try: k: 3 nu, q: 3, 0.99 m: 50 
.....
  bartMachine CV try: k: 3 nu, q: 3, 0.99 m: 200 
.....
  bartMachine CV try: k: 3 nu, q: 10, 0.75 m: 50 
.....
  bartMachine CV try: k: 3 nu, q: 10, 0.75 m: 200 
.....
  bartMachine CV try: k: 5 nu, q: 3, 0.9 m: 50 
.....
  bartMachine CV try: k: 5 nu, q: 3, 0.9 m: 200 
.....
  bartMachine CV try: k: 5 nu, q: 3, 0.99 m: 50 
.....
  bartMachine CV try: k: 5 nu, q: 3, 0.99 m: 200 
.....
  bartMachine CV try: k: 5 nu, q: 10, 0.75 m: 50 
.....
  bartMachine CV try: k: 5 nu, q: 10, 0.75 m: 200 
.....
 

In [7]:
# Run a new bartMachine case by reducing beta: this will add more levels to the trees (deeper trees).
bart_machine1 <- bartMachine(Xmatrix, y, 
                            num_trees = 200,
                            num_burn_in = 250,
                            num_iterations_after_burn_in = 1000,
                            alpha = 0.95, beta = 1, k = 2, q = 0.9, nu = 3,
                            prob_rule_class = 0.5,
                            mh_prob_steps = c(2.5, 2.5, 4)/9,
                            debug_log = FALSE,
                            run_in_sample = TRUE,
                            s_sq_y = "mse",
                            sig_sq_est = NULL,
                            cov_prior_vec = NULL,
                            use_missing_data = TRUE, 
                            covariates_to_permute = NULL,
                            num_rand_samps_in_library = 10000,
                            use_missing_data_dummies_as_covars = TRUE,
                            replace_missing_data_with_x_j_bar = FALSE,
                            impute_missingness_with_rf_impute = FALSE,
                            impute_missingness_with_x_j_bar_for_lm = TRUE,
                            mem_cache_for_speed = TRUE,
                            serialize = TRUE,
                            seed = NULL,
                            verbose = TRUE)  

summary(bart_machine1)

bartMachine initializing with 200 trees...
bartMachine vars checked...
bartMachine java init...
bartMachine factors created...
bartMachine before preprocess...
bartMachine after preprocess... 55 total features...
bartMachine sigsq estimated...
bartMachine training data finalized...
Now building bartMachine for regression ...Missing data feature ON. Missingness used as covariates. 
evaluating in sample data...done
serializing in order to be saved for future R sessions...done
bartMachine v1.2.3 for regression

Missing data feature ON
training data n = 250 and p = 54 
built in 7.4 secs on 4 cores, 200 trees, 250 burn-in and 1000 post. samples

sigsq est for y beforehand: 340.801 
avg sigsq estimate after burn-in: 166.52864 

in-sample statistics:
 L1 = 1999.57 
 L2 = 35577.83 
 rmse = 11.93 
 Pseudo-Rsq = 0.8188
p-val for shapiro-wilk test of normality of residuals: 0 
p-val for zero-mean noise: 0.94533 



In [8]:
# Save bartMachine objects to files.  The saveRDS function saves only an object at a time.
# Use readRDS() to load the objects back into R.
saveRDS(bart_machine, file = "bart_machine.rds")
saveRDS(bart_machine_cv5fold, file = "bart_machine_cv5fold.rds")
saveRDS(bart_machine1, file = "bart_machine_DeeperTrees.rds")