gbm() binary classification of factors seriously broken #6

mzoll · 2017-05-03T13:52:37Z

Hej, I am fairly new to R and wanted to try out your gbm implementation. First, I tried running and understanding the example: done and done. After that I wanted to applicate it to my own data and the algorithm blew up into my face, died a horrible deadth and crashed the entire R-session.
It took me some trials and test, why this is so, which eventually lead to much more strangeness in the input consistency that is required on the call to gbm:

Issue 1:
Classification with 'bernoulli' and 'cross-validation' requires the target column to be either numeric in {0,1} or logical. Using factors, even if of one of the former sets will result in an error or a crashed session. I suspect this error has to do with the subroutine of cross-validation and its expected inputs.

target column type:
logical > OK;
numeric in {0,1} > OK
factor of logical > #Error in checkForRemoteErrors(val) : #3 nodes produced errors; first error: Bernoulli requires the response to be in {0,1}
factor of numeric in {0,1} > horrible horrible death and crashed session

Examples at the end of file

Issue 2:
the evaluation of the parameter field 'formula' seems to be to greedy for mathematical operator aggregation, which might be part of the variable name (this can be an issue of the base-formula):

if the measured in dataframe has a to be analyzed variable, which contains letter combination replicating keywords of mathematical formulas, they will rather be interpreted as such than as part of the variable name:

Example: variable 'expectation' or 'logical' in the field 'formula' will rather be interpreted as "'exp' 'ectation'" and "'log' of 'ical'". At least the algorithm behaves and throws back errors at me suggesting this.

====
`
library (gbm)

N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])

mu <- c(-1,0,1,2)[as.numeric(X3)]

SNR <- 10 # signal-to-noise ratio
Y <- X11.5 + 2 * (X2.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)

introduce some missing values

X1[sample(1:N,size=N/4.)] <- NA

Z <- Y<0.5 | X3 %in% c('b','d')

data <- data.frame(Z = as.logical(Z),
X1 = X1,
X2 = X2,
X3 = X3)

gbm1 <-
gbm(Z~X1+X2+X3,
data=data,
distribution='bernoulli',
n.trees=200,
cv.folds = 3,
keep.data=TRUE)
#OK

dataZ <- factor(as.locial(Z))
gbm1 <-
gbm(Z~X1+X2+X3,
data=data,
distribution='bernoulli',
n.trees=200,
cv.folds = 3,
keep.data=TRUE)
#Error in checkForRemoteErrors(val) :
#3 nodes produced errors; first error: Bernoulli requires the response to be in {0,1}

data$Z <- as.numeric(Z)
gbm1 <-
gbm(Z~X1+X2+X3,
data=data,
distribution='bernoulli',
n.trees=200,
cv.folds = 3,
keep.data=TRUE)
#OK

data$Z <- factor(as.numeric(Z))

gbm1 <-
gbm(Z~X1+X2+X3,
data=data,
distribution='bernoulli',
n.trees=200,
cv.folds = 3,
keep.data=TRUE)
#HORRIBLE HORRIBLE DEATH

data$expectation <- as.numeric(Z)
gbm1 <-
gbm(expectation~X1+X2+X3,
data=data,
distribution='bernoulli',
n.trees=200,
cv.folds = 3,
keep.data=TRUE)
#ERROR

Error in object$var.levels[[i]] : subscript out of bounds

`

bgreenwell · 2017-06-21T18:41:30Z

@mzoll sorry for the delayed response. I think you're right about the CV and R crashing (the response format is not checked before being passed to some of the CV routines), I'll try to locate and patch the issue soon. FYI, there are a few typos in your example code, so I was not able to run it exactly, but I did reproduce the error. Also, I'm not sure what you mean regarding issue 2, could you kindly clarify, and maybe provide a reproducible example?

Thanks for reporting the issue!

richie312 · 2018-05-02T04:28:40Z

Very true. I was making a generic user defined function for bagging, randomForest and gbm by the following code,

##Loading packages

pacman::p_load(shiny,shinydashboard,gbm, randomForest,ggplot2,ipred,caret,ROCR,dplyr,ModelMetrics)

user defined function

 model = function(algo =gbm ,distribution = 'bernoulli', 
             type = 'response', set ='AUC',n.trees =10000){

 ## Fit the model
  
  model<- algo(formula = default ~ ., 
               distribution = distribution,
               data = train,
               n.trees = n.trees,
               cv.fold= 3)
  
  ## Generate the prediction on the test set
  
  pred<- predict(object = model,
                 newdata = test,
                 n.trees = n.trees,
                 type = type)
  
  ## Generate the test set AUCs using the pred
  
  AUC<- auc(actual = test$default, predicted = pred)
  

if (set == 'AUC'){
  return(AUC)
}

if (set == 'predictions'){
  return(pred)
}
if (set == 'model'){
  return(model)
  
}
else
  return(NULL)

 }

now call different model

List of different models

get_model<- function(algo,type = 'response', ntrees = 10000){
  z= model(algo = algo, type= type, set = 'model')

   }

  Bag_model<- get_model(algo = bagging, type='prob')
  RF_model<- get_model(algo = randomForest)
 GBM_model<- get_model(algo = gbm)

richie312 · 2018-05-02T04:35:45Z

This is for german credit dataset. The reponse variable is "default". Before applying the model, the
resposne variable is treated as,

        ## load the dataset
        data_x = read.csv("credit.csv")

       ## Preprocessing the dataset

       data_x$default <- ifelse(data_x$default == "yes", 1, 0)

The problem is with gbm. The R session will crash if response variable is not converted to factor.

If response variable is converted to factor then RandomForest will not produce the OOB error rate and confusion matrix in its output component.

Please advise.

bgreenwell · 2020-06-24T16:33:07Z

Should be fixed in the next patch (currently in dev).

bgreenwell closed this as completed Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gbm() binary classification of factors seriously broken #6

gbm() binary classification of factors seriously broken #6

mzoll commented May 3, 2017

bgreenwell commented Jun 21, 2017

richie312 commented May 2, 2018 •

edited

Loading

richie312 commented May 2, 2018

bgreenwell commented Jun 24, 2020

gbm() binary classification of factors seriously broken #6

gbm() binary classification of factors seriously broken #6

Comments

mzoll commented May 3, 2017

introduce some missing values

Error in object$var.levels[[i]] : subscript out of bounds

bgreenwell commented Jun 21, 2017

richie312 commented May 2, 2018 • edited Loading

user defined function

now call different model

List of different models

richie312 commented May 2, 2018

bgreenwell commented Jun 24, 2020

richie312 commented May 2, 2018 •

edited

Loading