Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Breaking] Change default evaluation metric for classification to logloss / mlogloss #6183

Merged
merged 8 commits into from Oct 2, 2020

Conversation

lorentzenchr
Copy link
Contributor

Closes #6070.
Change default evaluation metric for binary classification to "logloss", and for muli-class classification to "mlogloss".

@hcho3
Copy link
Collaborator

hcho3 commented Sep 29, 2020

This pull request does not yet add a warning for performing early stopping with a default metric. I cannot approve it as it is. Would you like to give it a try and try to add the warning?

@hcho3
Copy link
Collaborator

hcho3 commented Sep 29, 2020

Let me know if you need help.

@lorentzenchr
Copy link
Contributor Author

@hcho3 Thanks for the super fast feedback. Yes, I think I'll need some help.
First of all, how can I make CI happy?
I thought, I'll first do that and only when it's green, remove the [WIP] tag and add warnings if the default evaluation metric is not set explicitly.

@hcho3
Copy link
Collaborator

hcho3 commented Sep 29, 2020

@lorentzenchr You should update this line too:

const char* DefaultEvalMetric() const override {
return "error";
}

This should fix the error in the Google C++ test. (We use Google Test framework, so hence the name.)

Also, many unit tests for the R package have assertions that checks the default evaluation metrics, so they need to be updated as well. For example:

expect_output(
bst <- xgboost(data = train$data, label = train$label, max_depth = 2,
eta = 1, nthread = 2, nrounds = nrounds, objective = "binary:logistic")
, "train-error")

There are many of such occurrences in other R unit tests.

@hcho3
Copy link
Collaborator

hcho3 commented Sep 30, 2020

@lorentzenchr

To make it easier, let's just throw a warning whenever no eval_metric is explicitly set, regardless of whether early stopping is enabled. This way, we only have to change a single place in the codebase, as follows:

xgboost/src/learner.cc

Lines 1033 to 1036 in dda9e1e

if (metrics_.size() == 0 && tparam_.disable_default_eval_metric <= 0) {
metrics_.emplace_back(Metric::Create(obj_->DefaultEvalMetric(), &generic_parameters_));
metrics_.back()->Configure({cfg_.begin(), cfg_.end()});
}

If you think this is a good idea, I'll add a commit to this pull request.

EDIT. See my latest commit.

@hcho3 hcho3 changed the title [WIP] Change DefaultEvalMetric of binary classification from error to logloss [WIP] Change default evaluation metric for classification to logloss / mlogloss Sep 30, 2020
@hcho3 hcho3 changed the title [WIP] Change default evaluation metric for classification to logloss / mlogloss Change default evaluation metric for classification to logloss / mlogloss Sep 30, 2020
@hcho3 hcho3 changed the title Change default evaluation metric for classification to logloss / mlogloss [Breaking] Change default evaluation metric for classification to logloss / mlogloss Sep 30, 2020
@codecov-commenter
Copy link

codecov-commenter commented Sep 30, 2020

Codecov Report

Merging #6183 into master will decrease coverage by 0.09%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6183      +/-   ##
==========================================
- Coverage   79.02%   78.93%   -0.10%     
==========================================
  Files          12       12              
  Lines        3104     3104              
==========================================
- Hits         2453     2450       -3     
- Misses        651      654       +3     
Impacted Files Coverage Δ
python-package/xgboost/tracker.py 93.97% <0.00%> (-1.21%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 444131a...161885d. Read the comment docs.

@lorentzenchr
Copy link
Contributor Author

@hcho3 Now, it is starting to look good. Thanks a lot for your help!

@@ -1031,6 +1031,18 @@ class LearnerImpl : public LearnerIO {
std::ostringstream os;
os << '[' << iter << ']' << std::setiosflags(std::ios::fixed);
if (metrics_.size() == 0 && tparam_.disable_default_eval_metric <= 0) {
auto warn_default_eval_metric = [](const std::string& objective, const std::string& before,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we place this warning inside objective function DefaultEvalMetric? Also, are we sure this warning is only emitted once during training?

Copy link
Collaborator

@hcho3 hcho3 Sep 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think moving the warning to ObjFunction::DefaultEvalMetric() is a good idea, since then the warning code would appear in multiple places. I'd like to keep it here so that we can easily remove it later.

Also, are we sure this warning is only emitted once during training?

Yes. The warning is emitted just before the default evaluation metric gets added to the vector metrics_. Once the default metric is in metrics_, the warning will not be thrown.

@hcho3
Copy link
Collaborator

hcho3 commented Sep 30, 2020

@mayer79 Can you review?

@@ -236,7 +238,7 @@ test_that("early stopping xgb.train works", {
test_that("early stopping using a specific metric works", {
set.seed(11)
expect_output(
bst <- xgb.train(param, dtrain, nrounds = 20, watchlist, eta = 0.6,
bst <- xgb.train(param[-2], dtrain, nrounds = 20, watchlist, eta = 0.6,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for? @lorentzenchr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

param[2] sets eval_metric. I exclude it here, because it is already specified by kwarg. Otherwise it throws a warnung and the test fails.

Should I add a comment or alternatively remove the kwarg?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Let's keep it there.

@mayer79
Copy link
Contributor

mayer79 commented Oct 1, 2020

@hcho3 and @lorentzenchr : I had a look at the changes. They lgtm. I was wondering about the related objective
"binary:logitraw", which uses "auc" as default. Consistently, this should be "logloss" as well. What do you think?

@hcho3
Copy link
Collaborator

hcho3 commented Oct 1, 2020

I prefer to keep the default for binary:logitraw. Is AUC not a proper scoring metric?

@lorentzenchr
Copy link
Contributor Author

lorentzenchr commented Oct 1, 2020

I prefer to keep the default for binary:logitraw. Is AUC not a proper scoring metric?

Unfortunately, not at all. There is no property/functional of the target distribution that is consistently estimated my maximizing AUC (or nobody has found one yet or I’m not aware of 😏).

@mayer79
Copy link
Contributor

mayer79 commented Oct 1, 2020

I prefer to keep the default for binary:logitraw. Is AUC not a proper scoring metric?

If I am not wrong, binary:logitraw trains the same model as binary:logistic, just without backtransforming the prediction to probability scale?

@hcho3
Copy link
Collaborator

hcho3 commented Oct 1, 2020

Yes, binary:logitraw produces models with logit output. Is logloss a proper scoring rule when output is logit?

@hcho3
Copy link
Collaborator

hcho3 commented Oct 1, 2020

FYI, I found this link that calls AUC a "semi-proper" scoring rule: https://stats.stackexchange.com/questions/339919/what-does-it-mean-that-auc-is-a-semi-proper-scoring-rule

@lorentzenchr
Copy link
Contributor Author

@hcho3 Although I find it a bit inconsistent to have different default values of eval_metric for binary:logitraw and binary:logistic (and reg:logistic), most use cases are covered with this PR changing the default only for binary:logistic. Let's move forward?

@hcho3
Copy link
Collaborator

hcho3 commented Oct 2, 2020

@lorentzenchr Sure, we can always come back to binary:logitraw later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants