Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update default objectives for automl #613

Merged
merged 3 commits into from
Apr 10, 2020

Conversation

dsherry
Copy link
Contributor

@dsherry dsherry commented Apr 9, 2020

Classification
We definitely shouldn't be using precision to rank the leaderboard for classification problems. If you just favor precision you'll end up with trivial models. I think log loss is a much better default to use. It works with predicted probabilities so it won't depend on the threshold value we choose. (We could decide to change this once we merge #346 which will add threshold tuning for binary classification).

Regression
For regression, R2 is fine, but I'd rather use MAE or MSE because my opinion is those are more standard. MAE in particular is in the original units of the target value, which is a nice property. But MSE is better at penalizing large errors.

It's worth noting these settings don't affect how each model's optimizer is working, but they do affect a) how the models are ranked and b) how the next set of pipeline parameters is chosen.

@dsherry dsherry added the enhancement An improvement to an existing feature. label Apr 9, 2020
@jeremyliweishih
Copy link
Collaborator

I agree that MSE and MAE are better metrics than R2 but what I like about R2 is that its easier to understand (bounded between 0 and 1) and maybe more interpretable because of that than MSE and MAE. Do you think this is value of R2 is worth considering for a default objective? I am not too sure if its valuable enough.

@codecov
Copy link

codecov bot commented Apr 9, 2020

Codecov Report

Merging #613 into master will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #613      +/-   ##
==========================================
+ Coverage   98.87%   98.90%   +0.02%     
==========================================
  Files         118      118              
  Lines        4456     4456              
==========================================
+ Hits         4406     4407       +1     
+ Misses         50       49       -1     
Impacted Files Coverage Δ
evalml/automl/auto_classification_search.py 100.00% <100.00%> (ø)
evalml/automl/auto_base.py 96.53% <0.00%> (+0.34%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fa399cc...20f5b07. Read the comment docs.

Copy link
Contributor

@kmax12 kmax12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change looks good to me, assuming tests pass

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like you might need to change the tests but otherwise LGTM 👍

@dsherry
Copy link
Contributor Author

dsherry commented Apr 9, 2020

@jeremyliweishih I hear you about R2. Certainly having the output have an upper bound of 1 is a nice property. And having normalization is nice in many (but not all) cases.

If we're trying to maximize interpretability, I think MAE is the way to go, because its in the units of the target. An MAE of 10 literally means that on average, assuming i.i.d. data, your model will be off by 10 units. That's pretty hard to beat. MSE is a close second for me because it's easy to explain: sum squared error.

However I don't think interpretability is the most important thing for an automl metric. Metrics define what we value in a model; in automl they provide a way to compare two models and decide which is better using those values. I want a metric which does that as effectively as possible, because it'll favor models which match my values. That's one of the reasons I think it's pretty cool we support custom objectives.

So why MSE over R2 here? My argument would really boil down to the fact that MSE is simpler, but still penalizes large errors using an L2 norm rather than L1. I don't have much more of an argument than that, and could probably be convinced otherwise if you have a counter argument, haha. And I expect our position on this will evolve.

Also, this is somewhat of a tangent, but there is no lower bound on R2 in practice:

def r2(y_actual, y_predicted):
    total_sum_of_squares = ((y_actual - y_actual.mean())**2).sum()
    residual_sum_of_squares = ((y_actual - y_predicted)**2).sum()
    return 1 - residual_sum_of_squares / total_sum_of_squares

Note that if the residual sum of squares is greater than the total sum of squares, the resulting value will dip below 0. If this happens in practice the right thing to do is to throw your model in a trash can and light it on fire 🗑️ 🔥 😂

I thought this was an interesting read.

@dsherry
Copy link
Contributor Author

dsherry commented Apr 9, 2020

Yep, just wanted to get the change reviewed, will circle back and fix tests later. Thanks!

@jeremyliweishih
Copy link
Collaborator

jeremyliweishih commented Apr 9, 2020

Haha negative r2 has happened to me before 😅

@angela97lin
Copy link
Contributor

^lol same with me training a linear regression model today!

@dsherry
Copy link
Contributor Author

dsherry commented Apr 9, 2020

@angela97lin oh no, lol! @jeremyliweishih yeah it's happened to me a couple times but always was the result of a bug or two 😂 A trivial model can have negative R2:

Screen Shot 2020-04-09 at 4 08 53 PM

@dsherry dsherry force-pushed the ds_update_default_classification_objective branch from ab200aa to 20f5b07 Compare April 10, 2020 22:56
@dsherry
Copy link
Contributor Author

dsherry commented Apr 10, 2020

After sleeping on it, switching regression back to R2. I do think using log loss for classification is an improvement though, and will merge that once tests are green

@dsherry dsherry merged commit 434e7e3 into master Apr 10, 2020
@dsherry dsherry deleted the ds_update_default_classification_objective branch April 10, 2020 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement to an existing feature.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants