Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better default value for data_checks in AutoSearchBase #892

Merged

Conversation

freddyaboulton
Copy link
Contributor

Pull Request Description

Previously, it was awkward to disable data checking in AutoSearchBase.search because the default value was None and it corresponded to DefaultDataChecks. This PR addresses #799 by letting the data_checks parameter be a list, str, DataChecks, or None to give the user more control over the data checks they want to perform.


After creating the pull request: in order to pass the changelog_updated check you will need to update the "Future Release" section of docs/source/changelog.rst to include this pull request by adding :pr:123.

@freddyaboulton freddyaboulton changed the title 799 better handling of default vs disable vs custom Better default value for data_checks in AutoSearchBase Jun 26, 2020
@codecov
Copy link

codecov bot commented Jun 26, 2020

Codecov Report

Merging #892 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #892   +/-   ##
=======================================
  Coverage   99.77%   99.77%           
=======================================
  Files         193      193           
  Lines        8851     8881   +30     
=======================================
+ Hits         8831     8861   +30     
  Misses         20       20           
Impacted Files Coverage Δ
evalml/automl/automl_search.py 98.45% <100.00%> (+0.04%) ⬆️
evalml/data_checks/data_checks.py 100.00% <100.00%> (ø)
evalml/tests/automl_tests/test_automl.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d83c70f...7dcb9ce. Read the comment docs.

@freddyaboulton freddyaboulton self-assigned this Jun 26, 2020
@freddyaboulton freddyaboulton marked this pull request as ready for review June 26, 2020 19:35
@freddyaboulton
Copy link
Contributor Author

freddyaboulton commented Jun 26, 2020

@dsherry @angela97lin Fixed the diff, not sure what happened. Sorry about that!

@dsherry
Copy link
Contributor

dsherry commented Jun 26, 2020

@freddyaboulton lol no problem! Thanks, will take a look.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton functionality and tests look good! Left a comment about splitting the validation between DataChecks and automl search. Once we resolve that conversation I'll approve.

@@ -20,12 +20,13 @@ Changelog
* Added SelectColumns transformer :pr:`873`
* Added ability to evaluate additional pipelines for automl search :pr:`874`
* Added `default_parameters` class property to components and pipelines :pr:`879`
* Better support for disabling data checks in automl search :pr:`892`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add something to the breaking changes section in here as well, because we're deleting EmptyDataChecks and changing the pattern

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also rephrase this as "Added better support..." w.r.t our contribution docs!

if not isinstance(data_checks, DataChecks):
raise ValueError("data_checks parameter must be a DataChecks object!")

data_checks = self._validate_data_checks(data_checks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have the DataCheck object do the list validation part for us, but do the string parsing and stuff here, because that's specific to automl search. To steal from what you've already got:

if data_checks == 'auto':
    data_checks = DefaultDataChecks()
elif data_checks == 'disabled':
    data_checks = EmptyDataChecks() # could still use this internally
elif isinstance(data_checks, list):
    data_checks = DataChecks(data_checks)
elif not isinstance(data_checks, DataChecks):
    raise AutoMLSearchException('invalid value of type {} provided for "data_checks" parameter'.format(...))

And then we update DataChecks.__init__ to validate the list input, as you have above.

The nice thing about this is that our DataChecks can take a list right now, so we'd be able to reuse the list validation code in the future.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me! Makes sense to move the list validaton to DataChecks since its already expecting a list.

automl.search(X, y, data_checks=[1])
with pytest.raises(ValueError, match="If data_checks is a string, it must be either 'auto' or 'disabled'. "
"Received 'default'."):
automl.search(X, y, data_checks="default")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice


_run_test(with_data_checks=[MockDataCheckErrorAndWarning()])
_run_test(with_data_checks=DataChecks([MockDataCheckErrorAndWarning()]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO its fine to scope this public since its test code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, perhaps instead of this you could use @pytest.mark.parametrize, since that's a pattern we've started using? I think its a bit easier to read but I acknowledge that's an opinion on style and not a functional difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll use parametrize!

raise an exception.

Returns:
An instance of DataChecks to perform.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"An instance of DataChecks used to perform checks before search..." or something along those lines?

Comment on lines 234 to 235
automl.search(X, y, data_checks="disabled")
assert automl.data_check_results is None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate of above?

automl.search(X, y, data_checks=1)
with pytest.raises(ValueError, match="All elements of parameter data_checks must be an instance of DataCheck."):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe silly and unnecessary but could be good to add a test where one element is a data check but one element isn't? Ex: [EmptyDataChecks(), 1].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

@freddyaboulton
Copy link
Contributor Author

@dsherry @angela97lin I have addressed your comments!

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!


if isinstance(data_checks, DataChecks):
return data_checks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silly style nit-pick: delete the newlines to keep the file smaller? 😂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes 😬

@@ -241,10 +275,12 @@ def search(self, X, y, data_checks=None, feature_types=None, raise_errors=True,
show_iteration_plot (boolean, True): Shows an iteration vs. score plot in Jupyter notebook.
Disabled by default in non-Jupyter enviroments.

data_checks (DataChecks, None): A collection of data checks to run before searching for the best classifier. If data checks produce any errors, an exception will be thrown before the search begins. If None, uses DefaultDataChecks. Defaults to None.
data_checks (DataChecks, list(Datacheck), str, None): A collection of data checks to run before searching
for the best classifier. If data checks produce any errors, an exception will be thrown before the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of saying "classifier" can say "A collection of data checks to run before automl search."

@@ -8,6 +11,10 @@ def __init__(self, data_checks=None):
Arguments:
data_checks (list (DataCheck)): list of DataCheck objects
"""

if not all(isinstance(check, DataCheck) for check in data_checks):
raise ValueError("All elements of parameter data_checks must be an instance of DataCheck.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Can't we also move the list check in here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

assert automl.data_check_results is None

automl.search(X, y, data_checks=None)
assert automl.data_check_results is None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton would this dataset normally return data check results? If not, we should find a dataset or mocking pattern to ensure that's the case, so that this test proves the checks are in fact disabled here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I don't think this dataset would return any check errors. I'll make the necessary changes.


@pytest.mark.parametrize("data_checks",
[[MockDataCheckErrorAndWarning()],
DataChecks([MockDataCheckErrorAndWarning()])])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@pytest.mark.parametrize("data_checks",
[[MockDataCheckErrorAndWarning()],
DataChecks([MockDataCheckErrorAndWarning()])])
def test_automl_data_checks_raises_error(data_checks, caplog):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that by default we should mock fit/score on automl tests which run search, if its not too much effort to add

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely!

@freddyaboulton freddyaboulton merged commit 96aecd8 into master Jun 29, 2020
@freddyaboulton freddyaboulton deleted the 799-better-handling-of-default-vs-disable-vs-custom branch June 29, 2020 16:45
@freddyaboulton freddyaboulton added this to the June 2020 milestone Jun 29, 2020
@angela97lin angela97lin mentioned this pull request Jun 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automl data_checks parameter: how best to support disable vs default vs custom
3 participants