Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

base_score default #799

Closed
ronmexico2718 opened this issue Feb 2, 2016 · 12 comments
Closed

base_score default #799

ronmexico2718 opened this issue Feb 2, 2016 · 12 comments

Comments

@ronmexico2718
Copy link

At least in the R package, it seems that the base_score defaults to 0.5. For a classification problem, this seems to be a reasonable choice. However, this strikes me as a bizarre choice for a regression problem. Would it make more sense to set base_score=mean(label) by default? This choice might even be an improvement over the default in both regression and classification settings.

If the learning rate/shrinkage were set to 1, then the choice of base_score would be irrelevant as the resulting model would just compensate for any changes in the mean. However, essentially all reasonable implementations of boosted trees use learning rate much smaller than 1. Thus, in principle, it would seem that the choice of base_score would affect the final model.

I understand that it is extremely straightforward to set the base_score manually, but if there is a way to pick a better default, it may be worth implementing.

Any thoughts?

@tqchen
Copy link
Member

tqchen commented Feb 2, 2016

Normally the choice of base_score won't affect final performance that much as long as enough steps are run. But I agree that allowing setting it to mean(label) could be helpful in some cases.

@ronmexico2718
Copy link
Author

Agree that the effects should be minimal. I'll run some tests when I have some time

@khotilov
Copy link
Member

khotilov commented Feb 3, 2016

For highly imbalanced data, I sometimes restrict max_delta_step to smaller
values. And then it is not so rare to see xgboost struggling for several
iterations to produce any improvement. But the learning progress is much
better if I provide a meaningful initial base_score value. So, having an
adaptive initialization for it would definitely be useful.

On Tue, Feb 2, 2016 at 11:28 AM, ronmexico2718 notifications@github.com
wrote:

Agree that the effects should be minimal. I'll run some tests when I have
some time


Reply to this email directly or view it on GitHub
#799 (comment).

@luoruisichuan
Copy link

For highly imbalanced data, why not tune parameter "scale_pos_weight" for classification problem? Do you know in what case parameter "scale_pos_weight" should be tuned? and in what case parameter "max_delta_step" should be tuned?

@gugatr0n1c
Copy link

for regression task base_score = mean(label) reduce number of epochs about 25 percent... (on my dataset), I am setting this manually always for regression... so voting for this as well..

@ronmexico2718
Copy link
Author

@gugatr0n1c I also am noticing substantial differences in the number of trees required, in both classification and regression. The number of trees required is smallest when I set base_score=mean(label). The performance doesn't change much (in terms of CV error). R code is below. Anyone have any objections to setting the default base_score to mean(label)?

# -runs 2 tests for xgboost R package (one for regression, another classification)
# -tests how cv error and optimal number of trees depends on base_score for a simple problem

library(xgboost)
library(MASS)
library(boot)

######################
# REGRESSION EXAMPLE #
######################

#try a few base_scores raning between +/- base_score_range
base_score_range <- 10
base_score_test_regression <- c(-base_score_range, -base_score_range/2, 0, base_score_range/2, base_score_range) 
xg_pars_regression <- list(eta=0.01, max.depth=3)
n_trees_max_regression <- 1000

#run xgboost cross validation for each choice of base_score
NRuns_regression <- 10 #number of runs for each base score
cv_err_regression <- matrix(0, nrow=NRuns_regression, ncol=length(base_score_test_regression))
n_trees_opt_regression <- matrix(0, nrow=NRuns_regression, ncol=length(base_score_test_regression))
for (i in 1:NRuns_regression) {
  #generate data from simple polynomial function +noise
  n <- 1000
  x <- mvrnorm(n=n, mu=c(0,0,0), Sigma=matrix(c(1,0,0,0,1,0,0,0,1), c(3,3)))
  y_regression <- 2 * (x[,1]>=0.3) + 2 * (x[,2] < 0) + x[,3] ^ 2 + x[,1] * x[,2] + x[,2] * x[,3] + 10*rnorm(n=nrow(x))
  y_regression <- y_regression - mean(y_regression) #make mean 0
  for (j in 1:length(base_score_test_regression)) {
    print(paste0("Run ", i, " of ", NRuns_regression, ", base_score ", j, " of ", length(base_score_test_regression)))

    #set base score
    xg_pars_regression$base_score <- base_score_test_regression[j]

    #run cross validation
    xg_cv_results_regression <- xgb.cv(xg_pars_regression
                            , data = xgb.DMatrix(data.matrix(x), label=data.matrix(y_regression), missing=NA)
                            , nrounds=n_trees_max_regression
                                      , nfold=3
                                      , verbose=FALSE
                                      )

    #keep track of cv error and optimal number of trees
    cv_err_regression[i,j] <- min(xg_cv_results_regression$test.rmse.mean)
    n_trees_opt_regression[i,j] <-which.min(xg_cv_results_regression$test.rmse.mean)
  }
}

cv_reg_df <- as.data.frame(cv_err_regression)
colnames(cv_reg_df) <- as.character(base_score_test_regression)
write.csv(colMeans(cv_reg_df), file="cv_reg_df.csv")

ntrees_reg_df <- as.data.frame(n_trees_opt_regression)
colnames(ntrees_reg_df) <- as.character(base_score_test_regression)
write.csv(colMeans(ntrees_reg_df), file="ntrees_reg_df.csv")

##########################
# CLASSIFICATION EXAMPLE #
##########################
base_score_test_classification <- seq(from=0.01, to=0.99, length.out=5)
xg_pars_classification <- list(eta=0.01, max.depth=3, objective="binary:logistic", eval_metric="logloss")
n_trees_max_classification <- 1000

#run xgboost cross validation for each choice of base_score
NRuns_classification <- 10 #number of runs for each base score
cv_err_classification <- matrix(0, nrow=NRuns_classification, ncol=length(base_score_test_classification))
n_trees_opt_classification <- matrix(0, nrow=NRuns_classification, ncol=length(base_score_test_classification))
for (i in 1:NRuns_classification) {
  #generate data from simple polynomial function +noise
  n <- 1000
  x <- mvrnorm(n=n, mu=c(0,0,0), Sigma=matrix(c(1,0,0,0,1,0,0,0,1), c(3,3)))
  y_classification_logit <- 2 * (x[,1]>=0.3) + 2 * (x[,2] < 0) + x[,3] ^ 2 + x[,1] * x[,2] + x[,2] * x[,3] + 10*rnorm(n=nrow(x))
  #make y -2 on logit scale. this makes P(y=1) decidedly different than 0.5
  y_classification_logit <- y_classification_logit - mean(y_classification_logit) - 20
  y_classification <- round(inv.logit(y_classification_logit))
  for (j in 1:length(base_score_test_classification)) {
    print(paste0("Run ", i, " of ", NRuns_classification, ", base_score ", j, " of ", length(base_score_test_classification)))
    #set base score
    xg_pars_classification$base_score <- base_score_test_classification[j]

    #run cv
    xg_cv_results_classification <- xgb.cv(xg_pars_classification
                                       , data = xgb.DMatrix(data.matrix(x), label=data.matrix(y_classification), missing=NA)
                                       , nrounds=n_trees_max_classification
                                       , nfold=5
                                       , verbose=FALSE
                                       , nthread=1)

    #keep track of cv error and optimal number of trees
    cv_err_classification[i,j] <- min(xg_cv_results_classification$test.logloss.mean)
    n_trees_opt_classification[i,j] <-which.min(xg_cv_results_classification$test.logloss.mean)
  }
}

cv_class_df <- as.data.frame(cv_err_classification)
colnames(cv_class_df) <- as.character(base_score_test_classification)
write.csv(colMeans(cv_class_df), file="cv_class_df.csv")

ntrees_class_df <- as.data.frame(n_trees_opt_classification)
colnames(ntrees_class_df) <- as.character(base_score_test_classification)
write.csv(colMeans(ntrees_class_df), file="ntrees_class_df.csv")

@gugatr0n1c
Copy link

Yeah, I agree.
It makes sence... when base_score is set to 0.5, and we get regression problem with np.average(label) very high and eta very small (I used eta = 0.0015 and num_tree = 5000) first hundres of trees just trying to catch this big mean - with eta = 0.0015 it take a lot of time just solving this problem... with base_score set mean(label) this is actually solved right away at begining... a next tree trying to solve original task...

@pietermarsman
Copy link

+1

@BlindApe
Copy link

And with multiclass the issue is more painful #1380
Begin with a prior estimates of each class probability seams reasonable.

@sebconort
Copy link

+1
If you think of all the computation time (and electric power) wasted by people not adusting their base_score manually, this is an important (ecological) issue!

We can think of all marketing scores, or credit scoring, or insurance pricing where final predicition distributions are centered around values below 10%.

I think Kaggle could sponsor the fix regarding how much money they would save with all those people running public scripts with xgboost and its default params ^^

@antoinecarme
Copy link

antoinecarme commented Mar 31, 2017

Hi all,

I am new to xgboost and am interested in an automatic choice for the base_score parameter.

The choice I would make (some mentioned it above) is based on class prior probabilities (label frequencies) in the multi-class case and the signal mean for the regression (equivalent to a centering).

All these choices are consistent. A regression with discrete 0-1 target has the 1 frequency as target mean and produces the same model as a classification on the same dataset, which is interesting as invariant behavior of the algorithm.

This leads to this parameter not needed at all for the end user (internal choice).

Thanks

Antoine

@tqchen tqchen closed this as completed Jul 4, 2018
@kdebrab
Copy link

kdebrab commented Sep 7, 2018

It seems to me that adjusting the base_score to mean(label) improves results considerably for regression with 'reg:gamma' as objective, see https://stats.stackexchange.com/a/365803/219262 for an example.

@lock lock bot locked as resolved and limited conversation to collaborators Dec 6, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants