base_score default #799

ronmexico2718 · 2016-02-02T03:43:39Z

At least in the R package, it seems that the base_score defaults to 0.5. For a classification problem, this seems to be a reasonable choice. However, this strikes me as a bizarre choice for a regression problem. Would it make more sense to set base_score=mean(label) by default? This choice might even be an improvement over the default in both regression and classification settings.

If the learning rate/shrinkage were set to 1, then the choice of base_score would be irrelevant as the resulting model would just compensate for any changes in the mean. However, essentially all reasonable implementations of boosted trees use learning rate much smaller than 1. Thus, in principle, it would seem that the choice of base_score would affect the final model.

I understand that it is extremely straightforward to set the base_score manually, but if there is a way to pick a better default, it may be worth implementing.

Any thoughts?

tqchen · 2016-02-02T17:21:10Z

Normally the choice of base_score won't affect final performance that much as long as enough steps are run. But I agree that allowing setting it to mean(label) could be helpful in some cases.

ronmexico2718 · 2016-02-02T17:28:05Z

Agree that the effects should be minimal. I'll run some tests when I have some time

khotilov · 2016-02-03T06:52:05Z

For highly imbalanced data, I sometimes restrict max_delta_step to smaller
values. And then it is not so rare to see xgboost struggling for several
iterations to produce any improvement. But the learning progress is much
better if I provide a meaningful initial base_score value. So, having an
adaptive initialization for it would definitely be useful.

On Tue, Feb 2, 2016 at 11:28 AM, ronmexico2718 notifications@github.com
wrote:

Agree that the effects should be minimal. I'll run some tests when I have
some time

—
Reply to this email directly or view it on GitHub
#799 (comment).

luoruisichuan · 2016-02-05T07:09:03Z

For highly imbalanced data, why not tune parameter "scale_pos_weight" for classification problem? Do you know in what case parameter "scale_pos_weight" should be tuned? and in what case parameter "max_delta_step" should be tuned?

gugatr0n1c · 2016-02-07T20:11:44Z

for regression task base_score = mean(label) reduce number of epochs about 25 percent... (on my dataset), I am setting this manually always for regression... so voting for this as well..

ronmexico2718 · 2016-02-09T00:44:05Z

@gugatr0n1c I also am noticing substantial differences in the number of trees required, in both classification and regression. The number of trees required is smallest when I set base_score=mean(label). The performance doesn't change much (in terms of CV error). R code is below. Anyone have any objections to setting the default base_score to mean(label)?

# -runs 2 tests for xgboost R package (one for regression, another classification)
# -tests how cv error and optimal number of trees depends on base_score for a simple problem

library(xgboost)
library(MASS)
library(boot)

######################
# REGRESSION EXAMPLE #
######################

#try a few base_scores raning between +/- base_score_range
base_score_range <- 10
base_score_test_regression <- c(-base_score_range, -base_score_range/2, 0, base_score_range/2, base_score_range) 
xg_pars_regression <- list(eta=0.01, max.depth=3)
n_trees_max_regression <- 1000

#run xgboost cross validation for each choice of base_score
NRuns_regression <- 10 #number of runs for each base score
cv_err_regression <- matrix(0, nrow=NRuns_regression, ncol=length(base_score_test_regression))
n_trees_opt_regression <- matrix(0, nrow=NRuns_regression, ncol=length(base_score_test_regression))
for (i in 1:NRuns_regression) {
  #generate data from simple polynomial function +noise
  n <- 1000
  x <- mvrnorm(n=n, mu=c(0,0,0), Sigma=matrix(c(1,0,0,0,1,0,0,0,1), c(3,3)))
  y_regression <- 2 * (x[,1]>=0.3) + 2 * (x[,2] < 0) + x[,3] ^ 2 + x[,1] * x[,2] + x[,2] * x[,3] + 10*rnorm(n=nrow(x))
  y_regression <- y_regression - mean(y_regression) #make mean 0
  for (j in 1:length(base_score_test_regression)) {
    print(paste0("Run ", i, " of ", NRuns_regression, ", base_score ", j, " of ", length(base_score_test_regression)))

    #set base score
    xg_pars_regression$base_score <- base_score_test_regression[j]

    #run cross validation
    xg_cv_results_regression <- xgb.cv(xg_pars_regression
                            , data = xgb.DMatrix(data.matrix(x), label=data.matrix(y_regression), missing=NA)
                            , nrounds=n_trees_max_regression
                                      , nfold=3
                                      , verbose=FALSE
                                      )

    #keep track of cv error and optimal number of trees
    cv_err_regression[i,j] <- min(xg_cv_results_regression$test.rmse.mean)
    n_trees_opt_regression[i,j] <-which.min(xg_cv_results_regression$test.rmse.mean)
  }
}

cv_reg_df <- as.data.frame(cv_err_regression)
colnames(cv_reg_df) <- as.character(base_score_test_regression)
write.csv(colMeans(cv_reg_df), file="cv_reg_df.csv")

ntrees_reg_df <- as.data.frame(n_trees_opt_regression)
colnames(ntrees_reg_df) <- as.character(base_score_test_regression)
write.csv(colMeans(ntrees_reg_df), file="ntrees_reg_df.csv")

##########################
# CLASSIFICATION EXAMPLE #
##########################
base_score_test_classification <- seq(from=0.01, to=0.99, length.out=5)
xg_pars_classification <- list(eta=0.01, max.depth=3, objective="binary:logistic", eval_metric="logloss")
n_trees_max_classification <- 1000

#run xgboost cross validation for each choice of base_score
NRuns_classification <- 10 #number of runs for each base score
cv_err_classification <- matrix(0, nrow=NRuns_classification, ncol=length(base_score_test_classification))
n_trees_opt_classification <- matrix(0, nrow=NRuns_classification, ncol=length(base_score_test_classification))
for (i in 1:NRuns_classification) {
  #generate data from simple polynomial function +noise
  n <- 1000
  x <- mvrnorm(n=n, mu=c(0,0,0), Sigma=matrix(c(1,0,0,0,1,0,0,0,1), c(3,3)))
  y_classification_logit <- 2 * (x[,1]>=0.3) + 2 * (x[,2] < 0) + x[,3] ^ 2 + x[,1] * x[,2] + x[,2] * x[,3] + 10*rnorm(n=nrow(x))
  #make y -2 on logit scale. this makes P(y=1) decidedly different than 0.5
  y_classification_logit <- y_classification_logit - mean(y_classification_logit) - 20
  y_classification <- round(inv.logit(y_classification_logit))
  for (j in 1:length(base_score_test_classification)) {
    print(paste0("Run ", i, " of ", NRuns_classification, ", base_score ", j, " of ", length(base_score_test_classification)))
    #set base score
    xg_pars_classification$base_score <- base_score_test_classification[j]

    #run cv
    xg_cv_results_classification <- xgb.cv(xg_pars_classification
                                       , data = xgb.DMatrix(data.matrix(x), label=data.matrix(y_classification), missing=NA)
                                       , nrounds=n_trees_max_classification
                                       , nfold=5
                                       , verbose=FALSE
                                       , nthread=1)

    #keep track of cv error and optimal number of trees
    cv_err_classification[i,j] <- min(xg_cv_results_classification$test.logloss.mean)
    n_trees_opt_classification[i,j] <-which.min(xg_cv_results_classification$test.logloss.mean)
  }
}

cv_class_df <- as.data.frame(cv_err_classification)
colnames(cv_class_df) <- as.character(base_score_test_classification)
write.csv(colMeans(cv_class_df), file="cv_class_df.csv")

ntrees_class_df <- as.data.frame(n_trees_opt_classification)
colnames(ntrees_class_df) <- as.character(base_score_test_classification)
write.csv(colMeans(ntrees_class_df), file="ntrees_class_df.csv")

gugatr0n1c · 2016-02-09T09:00:28Z

Yeah, I agree.
It makes sence... when base_score is set to 0.5, and we get regression problem with np.average(label) very high and eta very small (I used eta = 0.0015 and num_tree = 5000) first hundres of trees just trying to catch this big mean - with eta = 0.0015 it take a lot of time just solving this problem... with base_score set mean(label) this is actually solved right away at begining... a next tree trying to solve original task...

pietermarsman · 2017-02-15T16:54:01Z

+1

BlindApe · 2017-02-16T19:18:17Z

And with multiclass the issue is more painful #1380
Begin with a prior estimates of each class probability seams reasonable.

sebconort · 2017-03-23T11:12:28Z

+1
If you think of all the computation time (and electric power) wasted by people not adusting their base_score manually, this is an important (ecological) issue!

We can think of all marketing scores, or credit scoring, or insurance pricing where final predicition distributions are centered around values below 10%.

I think Kaggle could sponsor the fix regarding how much money they would save with all those people running public scripts with xgboost and its default params ^^

antoinecarme · 2017-03-31T17:10:16Z

Hi all,

I am new to xgboost and am interested in an automatic choice for the base_score parameter.

The choice I would make (some mentioned it above) is based on class prior probabilities (label frequencies) in the multi-class case and the signal mean for the regression (equivalent to a centering).

All these choices are consistent. A regression with discrete 0-1 target has the 1 frequency as target mean and produces the same model as a classification on the same dataset, which is interesting as invariant behavior of the algorithm.

This leads to this parameter not needed at all for the end user (internal choice).

Thanks

Antoine

kdebrab · 2018-09-07T11:09:15Z

It seems to me that adjusting the base_score to mean(label) improves results considerably for regression with 'reg:gamma' as objective, see https://stats.stackexchange.com/a/365803/219262 for an example.

khotilov mentioned this issue Apr 22, 2016

Question about setting prior probabilities: scale_pos_weight vs. base_score #1129

Closed

tqchen closed this as completed Jul 4, 2018

lock bot locked as resolved and limited conversation to collaborators Dec 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

base_score default #799

base_score default #799

ronmexico2718 commented Feb 2, 2016

tqchen commented Feb 2, 2016

ronmexico2718 commented Feb 2, 2016

khotilov commented Feb 3, 2016

luoruisichuan commented Feb 5, 2016

gugatr0n1c commented Feb 7, 2016

ronmexico2718 commented Feb 9, 2016

gugatr0n1c commented Feb 9, 2016

pietermarsman commented Feb 15, 2017

BlindApe commented Feb 16, 2017

sebconort commented Mar 23, 2017

antoinecarme commented Mar 31, 2017 •

edited

Loading

kdebrab commented Sep 7, 2018

base_score default #799

base_score default #799

Comments

ronmexico2718 commented Feb 2, 2016

tqchen commented Feb 2, 2016

ronmexico2718 commented Feb 2, 2016

khotilov commented Feb 3, 2016

luoruisichuan commented Feb 5, 2016

gugatr0n1c commented Feb 7, 2016

ronmexico2718 commented Feb 9, 2016

gugatr0n1c commented Feb 9, 2016

pietermarsman commented Feb 15, 2017

BlindApe commented Feb 16, 2017

sebconort commented Mar 23, 2017

antoinecarme commented Mar 31, 2017 • edited Loading

kdebrab commented Sep 7, 2018

antoinecarme commented Mar 31, 2017 •

edited

Loading