Prediction for logistic regression is broken in R package #1903

Closed
osofr opened this Issue Dec 23, 2016 · 4 comments

Projects

None yet

3 participants

@osofr
osofr commented Dec 23, 2016

The following code breaks both CRAN and drat repo installs of the R package on OS X.

require('xgboost')
data(agaricus.train, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
param <- list(objective = "reg:logistic",
              booster = "gblinear",
              lambda = 0L, alpha = 0L)
logist_reg <- xgb.train(param, dtrain, watchlist=list(train = dtrain),
                        metrics = list("rmse"),
                        early_stopping_rounds = 3, nrounds = 50, verbose = 2)
preds <- predict(logist_reg, dtrain)

The error produced when calling predict:

Error in predict.xgb.Booster(logist_reg, dtrain) :
[20:12:15] amalgamation/../src/gbm/gblinear.cc:179: Check failed: (ntree_limit) == (0) GBLinear::Predict ntrees is only valid for gbtree predictor

Full output with session info:

R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin13.4.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

[R.app GUI 1.68 (7238) x86_64-apple-darwin13.4.0]

require('xgboost')
Loading required package: xgboost
Warning message:
package ‘xgboost’ was built under R version 3.3.2
data(agaricus.train, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
param <- list(objective = "reg:logistic",

  •           booster = "gblinear",
    
  •           lambda = 0L, alpha = 0L)
    

logist_reg <- xgb.train(param, dtrain, watchlist=list(train = dtrain),

  •                     metrics = list("rmse"),
    
  •                     early_stopping_rounds = 3, nrounds = 50, verbose = 2)
    

[1] train-rmse:0.168630
Will train until train_rmse hasn't improved in 3 rounds.

[2] train-rmse:0.075750
[3] train-rmse:0.041182
[4] train-rmse:0.027356
[5] train-rmse:0.023118
[6] train-rmse:0.019936
[7] train-rmse:0.006382
[8] train-rmse:0.003435
[9] train-rmse:0.001683
[10] train-rmse:0.000794
[11] train-rmse:0.000368
[12] train-rmse:0.000168
[13] train-rmse:0.000076
[14] train-rmse:0.000034
[15] train-rmse:0.000016
[16] train-rmse:0.000007
[17] train-rmse:0.000003
[18] train-rmse:0.000001
[19] train-rmse:0.000001
[20] train-rmse:0.000000
[21] train-rmse:0.000000
[22] train-rmse:0.000000
[23] train-rmse:0.000000
Stopping. Best iteration:
[20] train-rmse:0.000000

preds <- predict(logist_reg, dtrain)
Error in predict.xgb.Booster(logist_reg, dtrain) :
[20:12:15] amalgamation/../src/gbm/gblinear.cc:179: Check failed: (ntree_limit) == (0) GBLinear::Predict ntrees is only valid for gbtree predictor

sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] xgboost_0.6-2

loaded via a namespace (and not attached):
[1] magrittr_1.5 Matrix_1.2-7.1 tools_3.3.1 stringi_1.1.2 grid_3.3.1 data.table_1.10.1 lattice_0.20-34

Environment info

Operating System: OS X

@joebling
joebling commented Jan 2, 2017

I just meet the same problem:

library(xgboost)
dtrain <- xgb.DMatrix(data=as.matrix(train[-1]), label = train$label)
dtest <- xgb.DMatrix(data=as.matrix(test[-1]), label = test$label)

param <- list(objective = "binary:logistic",
booster = "gblinear",
nthread = 6,
alpha = 0.0001,
lambda = 1,
eval_metric='auc'
)

watchlist <- list(train = dtrain, valid = dtest)

bst <- xgb.train(param, data = dtrain, nrounds = 300, watchlist, verbose = 1, early_stopping_rounds = 5)
pred <- predict(bst, dtrain)

pred <- predict(bst, dtrain)
Error in predict.xgb.Booster(bst, dtrain) :
[21:34:35] amalgamation/../src/gbm/gblinear.cc:177: Check failed: (ntree_limit) == (0) GBLinear::Predict ntrees is only valid for gbtree predictor

@osofr
osofr commented Jan 2, 2017

I think I understand the source of the problem a bit better now. This temporary hack fixes the problem, but it doesn't address the main issue. Just set the offending object parameter to null (note that that the predictions will be based on ALL iterations now, which will typically be larger than the best iteration):

bst$best_ntreelimit <- NULL

It seems that it would be considered non-standard (or even wrong by some) to use the validation data to determine when to stop the coordinate descent for glm. A typical approach would only rely on training data, but we would still need to use some criteria for early stopping for glm! It is my understanding that the current version of glm in xgboost results in error with any kind of early stopping, regardless of what type of test set was used: training, validation or cv. That seems like a bug to me.

@khotilov khotilov added a commit to khotilov/xgboost that referenced this issue Jan 4, 2017
@khotilov khotilov [R] fix #1903 37b50c4
@khotilov
Contributor
khotilov commented Jan 4, 2017

Thanks for reporting. The #1929 PR should help.

A typical approach is to use 3-way splitting into train/validation/test (or nested CV), where the validation set is used for hyperparameters tuning. Early stopping helps to tune nrounds. The early stopping in xgb.cv within training data could be used for that purpose as well.

@osofr
osofr commented Jan 6, 2017

Thanks for the prompt reply, much appreciated.

Just to clarify that I understand this correctly, my apologies, I am still fairly new to xgboost. So, to perform early stopping on training data, within xgb.cv, all I need to do is to engage the callback cb.early.stop and set the metric_name to the desired training data metric, e.g., "train-rmse"? Did not realize this could also be done inside xgb.cv, so if that's the case this is really fantastic!

@tqchen tqchen added a commit that closed this issue Jan 6, 2017
@khotilov @tqchen khotilov + tqchen [R] fix #1903 (#1929) 87e897f
@tqchen tqchen closed this in 87e897f Jan 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment