-
Notifications
You must be signed in to change notification settings - Fork 91
Use range-based approach for high variance check in AutoMLSearch #2622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2622 +/- ##
=======================================
+ Coverage 99.9% 99.9% +0.1%
=======================================
Files 298 298
Lines 27098 27161 +63
=======================================
+ Hits 27054 27117 +63
Misses 44 44
Continue to review full report at Codecov.
|
| score_needs_proba = True | ||
| perfect_score = 0.0 | ||
| is_bounded_like_percentage = False # Range [0, Inf) | ||
| expected_range = [0, 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the range of log loss is (0, inf), we choose to set its expected range much smaller. I chose [0, 1] since we ideally expect the models to output very similar scores throughout the different CV folds. If the scores are too different (in this case, difference >= 0.5), we want to warn users that there could be high variance
| if objectives == "Log Loss Binary": | ||
| assert automl._check_for_high_variance(pipeline, cv_scores) | ||
| else: | ||
| assert not automl._check_for_high_variance(pipeline, cv_scores) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This objective shouldn't raise any high variance warnings
ParthivNaresh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
| score_needs_proba = False | ||
| perfect_score = 1 | ||
| is_bounded_like_percentage = False # Range (-Inf, 1] | ||
| expected_range = [-1, 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing the conversation from the design doc, I think many negative R2 scores below 1 come from outliers which is what LogTransformer was hoping to address. That being said, I don't know if we're going to see similar scores across CV if outliers exist, in which case I think this will always throw high variance. Not a blocker, just something I think we should be aware of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ParthivNaresh That's a good point! When there are outliers, this will likely raise a high variance warning more often, but I think that makes sense to do as well. In scenarios where each fold can get different enough variation of data where the models perform fairly differently, raising a high variance warning would be beneficial
chukarsten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM! Thanks! I think I just put up a small comment that's a nit, but no big deal. Sorry, I reviewed this yesterday but somehow forgot to hit approve.
eccabay
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! My only request would be to add a small section in the defining custom objectives section of our docs to reflect the addition of the expected_range.
fix #2621
Design doc and discussion here