Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update AutoBase with data_checks parameter #765

Merged
merged 38 commits into from May 26, 2020
Merged

Update AutoBase with data_checks parameter #765

merged 38 commits into from May 26, 2020

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented May 11, 2020

@angela97lin angela97lin self-assigned this May 11, 2020
@codecov
Copy link

codecov bot commented May 11, 2020

Codecov Report

Merging #765 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #765   +/-   ##
=======================================
  Coverage   99.51%   99.51%           
=======================================
  Files         150      150           
  Lines        5781     5781           
=======================================
  Hits         5753     5753           
  Misses         28       28           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 56c427b...56c427b. Read the comment docs.

@angela97lin angela97lin requested a review from dsherry May 21, 2020
@angela97lin angela97lin marked this pull request as ready for review May 21, 2020
@kmax12
Copy link
Contributor

kmax12 commented May 22, 2020

this may be known, but wanted to chime in case it slipped through. I was looking at the docs and it appears the text doesn't match the code output.

it says If the data checks returns any error or warning messages, automl.search() will raise a ValueError and quit before searching, but the errors seem to actually be relatefd to the pipeline not, the data check.

Beyond that, do we really want the the search to quit if the data check simply raises a warning?

@angela97lin
Copy link
Contributor Author

angela97lin commented May 22, 2020

@kmax12 Yup, thanks for chiming in! I think when @dsherry reviewed, we came to the conclusion that it'd be better--now that we have some basic logger functionality in--to just warn for warnings and raise an error if there are any errors. This differs from our previous design, which was to raise a ValueError if any warnings or errors occurred, so I still need to go back and update the examples to use data that won't error out during the search!

@dsherry
Copy link
Collaborator

dsherry commented May 22, 2020

Yeah @kmax12 when I saw the code which said no tests will run if there's any warnings, I got cold feet, lol. I think part of the value of distinguishing between errors and warnings is that an error feels serious enough to interrupt the search, but a warning should just be surfaced to the user. As @angela97lin said we can always change our minds on this later without any breaking changes when we have user feedback.

@angela97lin angela97lin requested review from kmax12 and dsherry May 22, 2020
@angela97lin
Copy link
Contributor Author

angela97lin commented May 22, 2020

@dsherry @kmax12

Thanks for the comments--I've addressed all of them, but would love if you guys could take one last pass before I merge it in.

Here's the new doc I added in this PR: https://evalml.featurelabs.com/en/709_autobase/guardrails/data_checks.html

kmax12
kmax12 approved these changes May 22, 2020
Copy link
Contributor

@kmax12 kmax12 left a comment

everything looks good, but 2 suggestions

"cell_type": "markdown",
"metadata": {},
"source": [
"If you'd prefer not to run any data checks before running search, you can provide an `EmptyDataChecks` instance to `search()` instead."
Copy link
Contributor

@kmax12 kmax12 May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if you could disable data checks with data_checks=False? Seem a bit weird you have to import a class to turn something off. and then just get rid of EmptyDataChecks from our codebase

Copy link
Contributor Author

@angela97lin angela97lin May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmax12 Yeah, I agree with this sentiment that it's weird to need to import to disable haha @dsherry and I discussed this previously and thought that it'd be best to introduce something like EmptyDataChecks because its the most consistent (always either None / DataChecks object), rather than introducing bool/DataCheck object?

Copy link
Contributor

@kmax12 kmax12 May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other option could to to default to data_checks=True and then pass in data_checks=None to disable. I'm inclined to go this way because it's less ambiguous. It's not intuitive that data_checks=None would give you the default datachecks

Copy link
Contributor

@kmax12 kmax12 May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's okay to have mixed types with the bool for clarity

Copy link
Contributor Author

@angela97lin angela97lin May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm so are you suggesting:

  • data_checks = True --> will run DefaultDataChecks
  • data_checks = SomeDataChecksClass --> will run SomeDataChecksClass
  • data_checks = None --> will not run anything at all?

Copy link
Contributor

@kmax12 kmax12 May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, that is my suggestion. one amendment would be False would also not run any thing.

before we finalize, let's see what @dsherry thinks since I know you two already spoke

Copy link
Collaborator

@dsherry dsherry May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree there's probably a way to make this API easier to use. I'm not sure how I feel about mixing bool/None/class in one arg. If we were going to do that, we could also use string params instead of a bool, i.e. 'disabled' or 'default' or something.

The current interface satisfies the requirements, so let's merge this PR and discuss separately, perhaps in the epic #49

Copy link
Collaborator

@dsherry dsherry May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed as #799 let's continue talking there

docs/source/guardrails/data_checks.ipynb Show resolved Hide resolved
docs/source/changelog.rst Outdated Show resolved Hide resolved
@@ -136,7 +141,7 @@ def _get_funct_name(function):

return search_desc + rankings_desc

def search(self, X, y, feature_types=None, raise_errors=True, show_iteration_plot=True):
def search(self, X, y, data_checks=None, feature_types=None, raise_errors=True, show_iteration_plot=True):
Copy link
Collaborator

@dsherry dsherry May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning to this after our discussion with max on defaults, I see what he means about this being potentially misleading. Having None as the default implies to me that nothing will happen, haha. We have this problem elsewhere too (with the tuner class after #793 ). Perhaps the way around this is to have the default be 'default'. Although at that point, it could simply be DefaultDataChecks too.

I'll file it separately and let's get this PR merged :)

Copy link
Collaborator

@dsherry dsherry left a comment

This looks ready to go to me! I answered a couple comments and left two testing suggestions. Also filed #799 for separate discussion. I say 🚢 ! 🥳

@angela97lin angela97lin merged commit ef5d308 into master May 26, 2020
2 checks passed
@dsherry dsherry deleted the 709_autobase branch Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data Checks API: Update AutoBase with data_checks parameter
3 participants