-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HighVarianceCVDataCheck #1254
Conversation
@@ -763,10 +766,6 @@ def describe_pipeline(self, pipeline_id, return_dict=False): | |||
logger.info("Total training time (including CV): %.1f seconds" % pipeline_results["training_time"]) | |||
log_subtitle(logger, "Cross Validation", underline="-") | |||
|
|||
if pipeline_results["high_variance_cv"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the logging behavior from describe_pipeline
to _add_result
but otherwise kept the same usage of high_variance_cv
within rankings etc.. I feel like it is more appropriate to notify during the search process and not just when describing a pipeline. Happy to discuss further!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!
Codecov Report
@@ Coverage Diff @@
## main #1254 +/- ##
==========================================
+ Coverage 91.49% 99.93% +8.44%
==========================================
Files 208 210 +2
Lines 13211 13247 +36
==========================================
+ Hits 12088 13239 +1151
+ Misses 1123 8 -1115
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremyliweishih I think this looks great! I have some minor comments and a question about whether we should let users parametrize this check but I think the implementation is solid.
|
||
pipeline_name = trained_pipeline.name | ||
pipeline_summary = trained_pipeline.summary | ||
pipeline_id = len(self._results['pipeline_results']) | ||
|
||
high_variance_cv_check = HighVarianceCVDataCheck(threshold=0.2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should users be allowed to configure this threshold now that this is a DataCheck and we let users configure other data checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good thought! Perhaps we can turn this on and off depending on if data_checks = auto
vs. data_checks = disabled
. But I don't think it'll fit within the existing API for parameterizing data checks as all those data checks run before search is called whereas this check is called during search. I like the idea but we would need to think about what API changes to make to AutoMLSearch.search()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 yep @freddyaboulton I agree this should be configurable/disable-able.
This PR is essentially porting existing behavior into a new API (data checks). I'll file an issue now to track making this configurable.
@@ -763,10 +766,6 @@ def describe_pipeline(self, pipeline_id, return_dict=False): | |||
logger.info("Total training time (including CV): %.1f seconds" % pipeline_results["training_time"]) | |||
log_subtitle(logger, "Cross Validation", underline="-") | |||
|
|||
if pipeline_results["high_variance_cv"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@dsherry codecov was green before merging master but it's not reporting now. Could you help merge this in? Thanks! |
high_variance_cv = False | ||
|
||
if high_variance_cv_check_results: | ||
logger.warning(high_variance_cv_check_results[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremyliweishih does this show up in the console in a well-formatted way? I've noticed that str(check)
doesn't look great. You may have to call .message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremyliweishih looks great! Left one comment about how the warning appears -- I think you have to use the message
accessor. We should add unit test coverage for that.
Fixes #1117.