Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HighVarianceCVDataCheck #1254

Merged
merged 13 commits into from Oct 8, 2020
Merged

Add HighVarianceCVDataCheck #1254

merged 13 commits into from Oct 8, 2020

Conversation

jeremyliweishih
Copy link
Contributor

Fixes #1117.

@@ -763,10 +766,6 @@ def describe_pipeline(self, pipeline_id, return_dict=False):
logger.info("Total training time (including CV): %.1f seconds" % pipeline_results["training_time"])
log_subtitle(logger, "Cross Validation", underline="-")

if pipeline_results["high_variance_cv"]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the logging behavior from describe_pipeline to _add_result but otherwise kept the same usage of high_variance_cv within rankings etc.. I feel like it is more appropriate to notify during the search process and not just when describing a pipeline. Happy to discuss further!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

@jeremyliweishih jeremyliweishih self-assigned this Oct 1, 2020
@codecov
Copy link

codecov bot commented Oct 1, 2020

Codecov Report

Merging #1254 into main will increase coverage by 8.44%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1254      +/-   ##
==========================================
+ Coverage   91.49%   99.93%   +8.44%     
==========================================
  Files         208      210       +2     
  Lines       13211    13247      +36     
==========================================
+ Hits        12088    13239    +1151     
+ Misses       1123        8    -1115     
Impacted Files Coverage Δ
evalml/automl/automl_search.py 99.59% <100.00%> (+0.40%) ⬆️
evalml/data_checks/__init__.py 100.00% <100.00%> (ø)
evalml/data_checks/high_variance_cv_data_check.py 100.00% <100.00%> (ø)
evalml/tests/automl_tests/test_automl.py 100.00% <100.00%> (ø)
...a_checks_tests/test_high_variance_cv_data_check.py 100.00% <100.00%> (ø)
...s/prediction_explanations_tests/test_algorithms.py 100.00% <0.00%> (+1.11%) ⬆️
evalml/tests/component_tests/test_components.py 100.00% <0.00%> (+1.16%) ⬆️
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5ca749...3f80da9. Read the comment docs.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih I think this looks great! I have some minor comments and a question about whether we should let users parametrize this check but I think the implementation is solid.


pipeline_name = trained_pipeline.name
pipeline_summary = trained_pipeline.summary
pipeline_id = len(self._results['pipeline_results'])

high_variance_cv_check = HighVarianceCVDataCheck(threshold=0.2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should users be allowed to configure this threshold now that this is a DataCheck and we let users configure other data checks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good thought! Perhaps we can turn this on and off depending on if data_checks = auto vs. data_checks = disabled. But I don't think it'll fit within the existing API for parameterizing data checks as all those data checks run before search is called whereas this check is called during search. I like the idea but we would need to think about what API changes to make to AutoMLSearch.search().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 yep @freddyaboulton I agree this should be configurable/disable-able.

This PR is essentially porting existing behavior into a new API (data checks). I'll file an issue now to track making this configurable.

evalml/data_checks/high_variance_cv_data_check.py Outdated Show resolved Hide resolved
@@ -763,10 +766,6 @@ def describe_pipeline(self, pipeline_id, return_dict=False):
logger.info("Total training time (including CV): %.1f seconds" % pipeline_results["training_time"])
log_subtitle(logger, "Cross Validation", underline="-")

if pipeline_results["high_variance_cv"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense!

evalml/automl/automl_search.py Outdated Show resolved Hide resolved
evalml/data_checks/high_variance_cv_data_check.py Outdated Show resolved Hide resolved
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

evalml/tests/automl_tests/test_automl.py Outdated Show resolved Hide resolved
@jeremyliweishih
Copy link
Contributor Author

@dsherry codecov was green before merging master but it's not reporting now. Could you help merge this in? Thanks!

high_variance_cv = False

if high_variance_cv_check_results:
logger.warning(high_variance_cv_check_results[0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih does this show up in the console in a well-formatted way? I've noticed that str(check) doesn't look great. You may have to call .message

@jeremyliweishih jeremyliweishih merged commit cf8df40 into main Oct 8, 2020
2 checks passed
dsherry
dsherry approved these changes Oct 8, 2020
Copy link
Collaborator

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih looks great! Left one comment about how the warning appears -- I think you have to use the message accessor. We should add unit test coverage for that.

@dsherry dsherry mentioned this pull request Oct 29, 2020
@freddyaboulton freddyaboulton deleted the js_1117_variance branch May 13, 2022 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace high_variance_cv warning in AutoMLSearch to use a DataCheck
4 participants