-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update default objectives for automl #613
Conversation
I agree that MSE and MAE are better metrics than R2 but what I like about R2 is that its easier to understand (bounded between 0 and 1) and maybe more interpretable because of that than MSE and MAE. Do you think this is value of R2 is worth considering for a default objective? I am not too sure if its valuable enough. |
Codecov Report
@@ Coverage Diff @@
## master #613 +/- ##
==========================================
+ Coverage 98.87% 98.90% +0.02%
==========================================
Files 118 118
Lines 4456 4456
==========================================
+ Hits 4406 4407 +1
+ Misses 50 49 -1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change looks good to me, assuming tests pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like you might need to change the tests but otherwise LGTM 👍
@jeremyliweishih I hear you about R2. Certainly having the output have an upper bound of 1 is a nice property. And having normalization is nice in many (but not all) cases. If we're trying to maximize interpretability, I think MAE is the way to go, because its in the units of the target. An MAE of 10 literally means that on average, assuming i.i.d. data, your model will be off by 10 units. That's pretty hard to beat. MSE is a close second for me because it's easy to explain: sum squared error. However I don't think interpretability is the most important thing for an automl metric. Metrics define what we value in a model; in automl they provide a way to compare two models and decide which is better using those values. I want a metric which does that as effectively as possible, because it'll favor models which match my values. That's one of the reasons I think it's pretty cool we support custom objectives. So why MSE over R2 here? My argument would really boil down to the fact that MSE is simpler, but still penalizes large errors using an L2 norm rather than L1. I don't have much more of an argument than that, and could probably be convinced otherwise if you have a counter argument, haha. And I expect our position on this will evolve. Also, this is somewhat of a tangent, but there is no lower bound on R2 in practice:
Note that if the residual sum of squares is greater than the total sum of squares, the resulting value will dip below 0. If this happens in practice the right thing to do is to throw your model in a trash can and light it on fire 🗑️ 🔥 😂 I thought this was an interesting read. |
Yep, just wanted to get the change reviewed, will circle back and fix tests later. Thanks! |
Haha negative r2 has happened to me before 😅 |
^lol same with me training a linear regression model today! |
@angela97lin oh no, lol! @jeremyliweishih yeah it's happened to me a couple times but always was the result of a bug or two 😂 A trivial model can have negative R2: |
ab200aa
to
20f5b07
Compare
After sleeping on it, switching regression back to R2. I do think using log loss for classification is an improvement though, and will merge that once tests are green |
Classification
We definitely shouldn't be using precision to rank the leaderboard for classification problems. If you just favor precision you'll end up with trivial models. I think log loss is a much better default to use. It works with predicted probabilities so it won't depend on the threshold value we choose. (We could decide to change this once we merge #346 which will add threshold tuning for binary classification).
Regression
For regression, R2 is fine, but I'd rather use MAE or MSE because my opinion is those are more standard. MAE in particular is in the original units of the target value, which is a nice property. But MSE is better at penalizing large errors.
It's worth noting these settings don't affect how each model's optimizer is working, but they do affect a) how the models are ranked and b) how the next set of pipeline parameters is chosen.