-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base_score default #799
Comments
Normally the choice of base_score won't affect final performance that much as long as enough steps are run. But I agree that allowing setting it to mean(label) could be helpful in some cases. |
Agree that the effects should be minimal. I'll run some tests when I have some time |
For highly imbalanced data, I sometimes restrict max_delta_step to smaller On Tue, Feb 2, 2016 at 11:28 AM, ronmexico2718 notifications@github.com
|
For highly imbalanced data, why not tune parameter "scale_pos_weight" for classification problem? Do you know in what case parameter "scale_pos_weight" should be tuned? and in what case parameter "max_delta_step" should be tuned? |
for regression task base_score = mean(label) reduce number of epochs about 25 percent... (on my dataset), I am setting this manually always for regression... so voting for this as well.. |
@gugatr0n1c I also am noticing substantial differences in the number of trees required, in both classification and regression. The number of trees required is smallest when I set base_score=mean(label). The performance doesn't change much (in terms of CV error). R code is below. Anyone have any objections to setting the default base_score to mean(label)? # -runs 2 tests for xgboost R package (one for regression, another classification)
# -tests how cv error and optimal number of trees depends on base_score for a simple problem
library(xgboost)
library(MASS)
library(boot)
######################
# REGRESSION EXAMPLE #
######################
#try a few base_scores raning between +/- base_score_range
base_score_range <- 10
base_score_test_regression <- c(-base_score_range, -base_score_range/2, 0, base_score_range/2, base_score_range)
xg_pars_regression <- list(eta=0.01, max.depth=3)
n_trees_max_regression <- 1000
#run xgboost cross validation for each choice of base_score
NRuns_regression <- 10 #number of runs for each base score
cv_err_regression <- matrix(0, nrow=NRuns_regression, ncol=length(base_score_test_regression))
n_trees_opt_regression <- matrix(0, nrow=NRuns_regression, ncol=length(base_score_test_regression))
for (i in 1:NRuns_regression) {
#generate data from simple polynomial function +noise
n <- 1000
x <- mvrnorm(n=n, mu=c(0,0,0), Sigma=matrix(c(1,0,0,0,1,0,0,0,1), c(3,3)))
y_regression <- 2 * (x[,1]>=0.3) + 2 * (x[,2] < 0) + x[,3] ^ 2 + x[,1] * x[,2] + x[,2] * x[,3] + 10*rnorm(n=nrow(x))
y_regression <- y_regression - mean(y_regression) #make mean 0
for (j in 1:length(base_score_test_regression)) {
print(paste0("Run ", i, " of ", NRuns_regression, ", base_score ", j, " of ", length(base_score_test_regression)))
#set base score
xg_pars_regression$base_score <- base_score_test_regression[j]
#run cross validation
xg_cv_results_regression <- xgb.cv(xg_pars_regression
, data = xgb.DMatrix(data.matrix(x), label=data.matrix(y_regression), missing=NA)
, nrounds=n_trees_max_regression
, nfold=3
, verbose=FALSE
)
#keep track of cv error and optimal number of trees
cv_err_regression[i,j] <- min(xg_cv_results_regression$test.rmse.mean)
n_trees_opt_regression[i,j] <-which.min(xg_cv_results_regression$test.rmse.mean)
}
}
cv_reg_df <- as.data.frame(cv_err_regression)
colnames(cv_reg_df) <- as.character(base_score_test_regression)
write.csv(colMeans(cv_reg_df), file="cv_reg_df.csv")
ntrees_reg_df <- as.data.frame(n_trees_opt_regression)
colnames(ntrees_reg_df) <- as.character(base_score_test_regression)
write.csv(colMeans(ntrees_reg_df), file="ntrees_reg_df.csv")
##########################
# CLASSIFICATION EXAMPLE #
##########################
base_score_test_classification <- seq(from=0.01, to=0.99, length.out=5)
xg_pars_classification <- list(eta=0.01, max.depth=3, objective="binary:logistic", eval_metric="logloss")
n_trees_max_classification <- 1000
#run xgboost cross validation for each choice of base_score
NRuns_classification <- 10 #number of runs for each base score
cv_err_classification <- matrix(0, nrow=NRuns_classification, ncol=length(base_score_test_classification))
n_trees_opt_classification <- matrix(0, nrow=NRuns_classification, ncol=length(base_score_test_classification))
for (i in 1:NRuns_classification) {
#generate data from simple polynomial function +noise
n <- 1000
x <- mvrnorm(n=n, mu=c(0,0,0), Sigma=matrix(c(1,0,0,0,1,0,0,0,1), c(3,3)))
y_classification_logit <- 2 * (x[,1]>=0.3) + 2 * (x[,2] < 0) + x[,3] ^ 2 + x[,1] * x[,2] + x[,2] * x[,3] + 10*rnorm(n=nrow(x))
#make y -2 on logit scale. this makes P(y=1) decidedly different than 0.5
y_classification_logit <- y_classification_logit - mean(y_classification_logit) - 20
y_classification <- round(inv.logit(y_classification_logit))
for (j in 1:length(base_score_test_classification)) {
print(paste0("Run ", i, " of ", NRuns_classification, ", base_score ", j, " of ", length(base_score_test_classification)))
#set base score
xg_pars_classification$base_score <- base_score_test_classification[j]
#run cv
xg_cv_results_classification <- xgb.cv(xg_pars_classification
, data = xgb.DMatrix(data.matrix(x), label=data.matrix(y_classification), missing=NA)
, nrounds=n_trees_max_classification
, nfold=5
, verbose=FALSE
, nthread=1)
#keep track of cv error and optimal number of trees
cv_err_classification[i,j] <- min(xg_cv_results_classification$test.logloss.mean)
n_trees_opt_classification[i,j] <-which.min(xg_cv_results_classification$test.logloss.mean)
}
}
cv_class_df <- as.data.frame(cv_err_classification)
colnames(cv_class_df) <- as.character(base_score_test_classification)
write.csv(colMeans(cv_class_df), file="cv_class_df.csv")
ntrees_class_df <- as.data.frame(n_trees_opt_classification)
colnames(ntrees_class_df) <- as.character(base_score_test_classification)
write.csv(colMeans(ntrees_class_df), file="ntrees_class_df.csv") |
Yeah, I agree. |
+1 |
And with multiclass the issue is more painful #1380 |
+1 We can think of all marketing scores, or credit scoring, or insurance pricing where final predicition distributions are centered around values below 10%. I think Kaggle could sponsor the fix regarding how much money they would save with all those people running public scripts with xgboost and its default params ^^ |
Hi all, I am new to xgboost and am interested in an automatic choice for the base_score parameter. The choice I would make (some mentioned it above) is based on class prior probabilities (label frequencies) in the multi-class case and the signal mean for the regression (equivalent to a centering). All these choices are consistent. A regression with discrete 0-1 target has the 1 frequency as target mean and produces the same model as a classification on the same dataset, which is interesting as invariant behavior of the algorithm. This leads to this parameter not needed at all for the end user (internal choice). Thanks Antoine |
It seems to me that adjusting the base_score to mean(label) improves results considerably for regression with 'reg:gamma' as objective, see https://stats.stackexchange.com/a/365803/219262 for an example. |
At least in the R package, it seems that the base_score defaults to 0.5. For a classification problem, this seems to be a reasonable choice. However, this strikes me as a bizarre choice for a regression problem. Would it make more sense to set base_score=mean(label) by default? This choice might even be an improvement over the default in both regression and classification settings.
If the learning rate/shrinkage were set to 1, then the choice of base_score would be irrelevant as the resulting model would just compensate for any changes in the mean. However, essentially all reasonable implementations of boosted trees use learning rate much smaller than 1. Thus, in principle, it would seem that the choice of base_score would affect the final model.
I understand that it is extremely straightforward to set the base_score manually, but if there is a way to pick a better default, it may be worth implementing.
Any thoughts?
The text was updated successfully, but these errors were encountered: