-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gbm() binary classification of factors seriously broken #6
Comments
@mzoll sorry for the delayed response. I think you're right about the CV and R crashing (the response format is not checked before being passed to some of the CV routines), I'll try to locate and patch the issue soon. FYI, there are a few typos in your example code, so I was not able to run it exactly, but I did reproduce the error. Also, I'm not sure what you mean regarding issue 2, could you kindly clarify, and maybe provide a reproducible example? Thanks for reporting the issue! |
Very true. I was making a generic user defined function for bagging, randomForest and gbm by the following code, ##Loading packages pacman::p_load(shiny,shinydashboard,gbm, randomForest,ggplot2,ipred,caret,ROCR,dplyr,ModelMetrics) user defined function
now call different modelList of different models
|
This is for german credit dataset. The reponse variable is "default". Before applying the model, the
The problem is with gbm. The R session will crash if response variable is not converted to factor. If response variable is converted to factor then RandomForest will not produce the OOB error rate and confusion matrix in its output component. Please advise. |
Should be fixed in the next patch (currently in dev). |
Hej, I am fairly new to R and wanted to try out your gbm implementation. First, I tried running and understanding the example: done and done. After that I wanted to applicate it to my own data and the algorithm blew up into my face, died a horrible deadth and crashed the entire R-session.
It took me some trials and test, why this is so, which eventually lead to much more strangeness in the input consistency that is required on the call to gbm:
Issue 1:
Classification with 'bernoulli' and 'cross-validation' requires the target column to be either numeric in {0,1} or logical. Using factors, even if of one of the former sets will result in an error or a crashed session. I suspect this error has to do with the subroutine of cross-validation and its expected inputs.
target column type:
logical > OK;
numeric in {0,1} > OK
factor of logical > #Error in checkForRemoteErrors(val) : #3 nodes produced errors; first error: Bernoulli requires the response to be in {0,1}
factor of numeric in {0,1} > horrible horrible death and crashed session
Examples at the end of file
Issue 2:
the evaluation of the parameter field 'formula' seems to be to greedy for mathematical operator aggregation, which might be part of the variable name (this can be an issue of the base-formula):
if the measured in dataframe has a to be analyzed variable, which contains letter combination replicating keywords of mathematical formulas, they will rather be interpreted as such than as part of the variable name:
Example: variable 'expectation' or 'logical' in the field 'formula' will rather be interpreted as "'exp' 'ectation'" and "'log' of 'ical'". At least the algorithm behaves and throws back errors at me suggesting this.
====
`
library (gbm)
N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
mu <- c(-1,0,1,2)[as.numeric(X3)]
SNR <- 10 # signal-to-noise ratio
Y <- X11.5 + 2 * (X2.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)
introduce some missing values
X1[sample(1:N,size=N/4.)] <- NA
Z <- Y<0.5 | X3 %in% c('b','d')
data <- data.frame(Z = as.logical(Z),
X1 = X1,
X2 = X2,
X3 = X3)
gbm1 <-
gbm(Z~X1+X2+X3,
data=data,
distribution='bernoulli',
n.trees=200,
cv.folds = 3,
keep.data=TRUE)
#OK
dataZ <- factor(as.locial(Z))
gbm1 <-
gbm(Z~X1+X2+X3,
data=data,
distribution='bernoulli',
n.trees=200,
cv.folds = 3,
keep.data=TRUE)
#Error in checkForRemoteErrors(val) :
#3 nodes produced errors; first error: Bernoulli requires the response to be in {0,1}
data$Z <- as.numeric(Z)
gbm1 <-
gbm(Z~X1+X2+X3,
data=data,
distribution='bernoulli',
n.trees=200,
cv.folds = 3,
keep.data=TRUE)
#OK
data$Z <- factor(as.numeric(Z))
gbm1 <-
gbm(Z~X1+X2+X3,
data=data,
distribution='bernoulli',
n.trees=200,
cv.folds = 3,
keep.data=TRUE)
#HORRIBLE HORRIBLE DEATH
data$expectation <- as.numeric(Z)
gbm1 <-
gbm(expectation~X1+X2+X3,
data=data,
distribution='bernoulli',
n.trees=200,
cv.folds = 3,
keep.data=TRUE)
#ERROR
Error in object$var.levels[[i]] : subscript out of bounds
`
The text was updated successfully, but these errors were encountered: