Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gbm() binary classification of factors seriously broken #6

Closed
mzoll opened this issue May 3, 2017 · 4 comments
Closed

gbm() binary classification of factors seriously broken #6

mzoll opened this issue May 3, 2017 · 4 comments

Comments

@mzoll
Copy link

mzoll commented May 3, 2017

Hej, I am fairly new to R and wanted to try out your gbm implementation. First, I tried running and understanding the example: done and done. After that I wanted to applicate it to my own data and the algorithm blew up into my face, died a horrible deadth and crashed the entire R-session.
It took me some trials and test, why this is so, which eventually lead to much more strangeness in the input consistency that is required on the call to gbm:

Issue 1:
Classification with 'bernoulli' and 'cross-validation' requires the target column to be either numeric in {0,1} or logical. Using factors, even if of one of the former sets will result in an error or a crashed session. I suspect this error has to do with the subroutine of cross-validation and its expected inputs.

target column type:
logical > OK;
numeric in {0,1} > OK
factor of logical > #Error in checkForRemoteErrors(val) : #3 nodes produced errors; first error: Bernoulli requires the response to be in {0,1}
factor of numeric in {0,1} > horrible horrible death and crashed session

Examples at the end of file

Issue 2:
the evaluation of the parameter field 'formula' seems to be to greedy for mathematical operator aggregation, which might be part of the variable name (this can be an issue of the base-formula):

if the measured in dataframe has a to be analyzed variable, which contains letter combination replicating keywords of mathematical formulas, they will rather be interpreted as such than as part of the variable name:

Example: variable 'expectation' or 'logical' in the field 'formula' will rather be interpreted as "'exp' 'ectation'" and "'log' of 'ical'". At least the algorithm behaves and throws back errors at me suggesting this.

====
`
library (gbm)

N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])

mu <- c(-1,0,1,2)[as.numeric(X3)]

SNR <- 10 # signal-to-noise ratio
Y <- X11.5 + 2 * (X2.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)

introduce some missing values

X1[sample(1:N,size=N/4.)] <- NA

Z <- Y<0.5 | X3 %in% c('b','d')

data <- data.frame(Z = as.logical(Z),
X1 = X1,
X2 = X2,
X3 = X3)

gbm1 <-
gbm(Z~X1+X2+X3,
data=data,
distribution='bernoulli',
n.trees=200,
cv.folds = 3,
keep.data=TRUE)
#OK

dataZ <- factor(as.locial(Z))
gbm1 <-
gbm(Z~X1+X2+X3,
data=data,
distribution='bernoulli',
n.trees=200,
cv.folds = 3,
keep.data=TRUE)
#Error in checkForRemoteErrors(val) :
#3 nodes produced errors; first error: Bernoulli requires the response to be in {0,1}

data$Z <- as.numeric(Z)
gbm1 <-
gbm(Z~X1+X2+X3,
data=data,
distribution='bernoulli',
n.trees=200,
cv.folds = 3,
keep.data=TRUE)
#OK

data$Z <- factor(as.numeric(Z))

gbm1 <-
gbm(Z~X1+X2+X3,
data=data,
distribution='bernoulli',
n.trees=200,
cv.folds = 3,
keep.data=TRUE)
#HORRIBLE HORRIBLE DEATH

data$expectation <- as.numeric(Z)
gbm1 <-
gbm(expectation~X1+X2+X3,
data=data,
distribution='bernoulli',
n.trees=200,
cv.folds = 3,
keep.data=TRUE)
#ERROR

Error in object$var.levels[[i]] : subscript out of bounds

`

@bgreenwell
Copy link
Contributor

@mzoll sorry for the delayed response. I think you're right about the CV and R crashing (the response format is not checked before being passed to some of the CV routines), I'll try to locate and patch the issue soon. FYI, there are a few typos in your example code, so I was not able to run it exactly, but I did reproduce the error. Also, I'm not sure what you mean regarding issue 2, could you kindly clarify, and maybe provide a reproducible example?

Thanks for reporting the issue!

@richie312
Copy link

richie312 commented May 2, 2018

Very true. I was making a generic user defined function for bagging, randomForest and gbm by the following code,

##Loading packages

pacman::p_load(shiny,shinydashboard,gbm, randomForest,ggplot2,ipred,caret,ROCR,dplyr,ModelMetrics)

user defined function

 model = function(algo =gbm ,distribution = 'bernoulli', 
             type = 'response', set ='AUC',n.trees =10000){

 ## Fit the model
  
  model<- algo(formula = default ~ ., 
               distribution = distribution,
               data = train,
               n.trees = n.trees,
               cv.fold= 3)
  
  ## Generate the prediction on the test set
  
  pred<- predict(object = model,
                 newdata = test,
                 n.trees = n.trees,
                 type = type)
  
  ## Generate the test set AUCs using the pred
  
  AUC<- auc(actual = test$default, predicted = pred)
  

if (set == 'AUC'){
  return(AUC)
}

if (set == 'predictions'){
  return(pred)
}
if (set == 'model'){
  return(model)
  
}
else
  return(NULL)

 }

now call different model

List of different models

get_model<- function(algo,type = 'response', ntrees = 10000){
  z= model(algo = algo, type= type, set = 'model')

   }

  Bag_model<- get_model(algo = bagging, type='prob')
  RF_model<- get_model(algo = randomForest)
 GBM_model<- get_model(algo = gbm)

@richie312
Copy link

This is for german credit dataset. The reponse variable is "default". Before applying the model, the
resposne variable is treated as,

        ## load the dataset
        data_x = read.csv("credit.csv")

       ## Preprocessing the dataset

       data_x$default <- ifelse(data_x$default == "yes", 1, 0)

The problem is with gbm. The R session will crash if response variable is not converted to factor.

If response variable is converted to factor then RandomForest will not produce the OOB error rate and confusion matrix in its output component.

Please advise.

@bgreenwell
Copy link
Contributor

Should be fixed in the next patch (currently in dev).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants