-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better default value for data_checks in AutoSearchBase #892
Better default value for data_checks in AutoSearchBase #892
Conversation
… github.com:FeatureLabs/evalml into 799-better-handling-of-default-vs-disable-vs-custom
Codecov Report
@@ Coverage Diff @@
## master #892 +/- ##
=======================================
Coverage 99.77% 99.77%
=======================================
Files 193 193
Lines 8851 8881 +30
=======================================
+ Hits 8831 8861 +30
Misses 20 20
Continue to review full report at Codecov.
|
@dsherry @angela97lin Fixed the diff, not sure what happened. Sorry about that! |
@freddyaboulton lol no problem! Thanks, will take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton functionality and tests look good! Left a comment about splitting the validation between DataChecks
and automl search. Once we resolve that conversation I'll approve.
docs/source/changelog.rst
Outdated
@@ -20,12 +20,13 @@ Changelog | |||
* Added SelectColumns transformer :pr:`873` | |||
* Added ability to evaluate additional pipelines for automl search :pr:`874` | |||
* Added `default_parameters` class property to components and pipelines :pr:`879` | |||
* Better support for disabling data checks in automl search :pr:`892` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add something to the breaking changes section in here as well, because we're deleting EmptyDataChecks
and changing the pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also rephrase this as "Added better support..." w.r.t our contribution docs!
if not isinstance(data_checks, DataChecks): | ||
raise ValueError("data_checks parameter must be a DataChecks object!") | ||
|
||
data_checks = self._validate_data_checks(data_checks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have the DataCheck
object do the list validation part for us, but do the string parsing and stuff here, because that's specific to automl search. To steal from what you've already got:
if data_checks == 'auto':
data_checks = DefaultDataChecks()
elif data_checks == 'disabled':
data_checks = EmptyDataChecks() # could still use this internally
elif isinstance(data_checks, list):
data_checks = DataChecks(data_checks)
elif not isinstance(data_checks, DataChecks):
raise AutoMLSearchException('invalid value of type {} provided for "data_checks" parameter'.format(...))
And then we update DataChecks.__init__
to validate the list input, as you have above.
The nice thing about this is that our DataChecks
can take a list right now, so we'd be able to reuse the list validation code in the future.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me! Makes sense to move the list validaton to DataChecks since its already expecting a list.
automl.search(X, y, data_checks=[1]) | ||
with pytest.raises(ValueError, match="If data_checks is a string, it must be either 'auto' or 'disabled'. " | ||
"Received 'default'."): | ||
automl.search(X, y, data_checks="default") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
|
||
_run_test(with_data_checks=[MockDataCheckErrorAndWarning()]) | ||
_run_test(with_data_checks=DataChecks([MockDataCheckErrorAndWarning()])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO its fine to scope this public since its test code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, perhaps instead of this you could use @pytest.mark.parametrize
, since that's a pattern we've started using? I think its a bit easier to read but I acknowledge that's an opinion on style and not a functional difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll use parametrize!
evalml/automl/automl_search.py
Outdated
raise an exception. | ||
|
||
Returns: | ||
An instance of DataChecks to perform. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"An instance of DataChecks used to perform checks before search..." or something along those lines?
automl.search(X, y, data_checks="disabled") | ||
assert automl.data_check_results is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate of above?
automl.search(X, y, data_checks=1) | ||
with pytest.raises(ValueError, match="All elements of parameter data_checks must be an instance of DataCheck."): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe silly and unnecessary but could be good to add a test where one element is a data check but one element isn't? Ex: [EmptyDataChecks(), 1].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea!
@dsherry @angela97lin I have addressed your comments! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
evalml/automl/automl_search.py
Outdated
|
||
if isinstance(data_checks, DataChecks): | ||
return data_checks | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silly style nit-pick: delete the newlines to keep the file smaller? 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes 😬
evalml/automl/automl_search.py
Outdated
@@ -241,10 +275,12 @@ def search(self, X, y, data_checks=None, feature_types=None, raise_errors=True, | |||
show_iteration_plot (boolean, True): Shows an iteration vs. score plot in Jupyter notebook. | |||
Disabled by default in non-Jupyter enviroments. | |||
|
|||
data_checks (DataChecks, None): A collection of data checks to run before searching for the best classifier. If data checks produce any errors, an exception will be thrown before the search begins. If None, uses DefaultDataChecks. Defaults to None. | |||
data_checks (DataChecks, list(Datacheck), str, None): A collection of data checks to run before searching | |||
for the best classifier. If data checks produce any errors, an exception will be thrown before the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of saying "classifier" can say "A collection of data checks to run before automl search."
@@ -8,6 +11,10 @@ def __init__(self, data_checks=None): | |||
Arguments: | |||
data_checks (list (DataCheck)): list of DataCheck objects | |||
""" | |||
|
|||
if not all(isinstance(check, DataCheck) for check in data_checks): | |||
raise ValueError("All elements of parameter data_checks must be an instance of DataCheck.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
Can't we also move the list
check in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
assert automl.data_check_results is None | ||
|
||
automl.search(X, y, data_checks=None) | ||
assert automl.data_check_results is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton would this dataset normally return data check results? If not, we should find a dataset or mocking pattern to ensure that's the case, so that this test proves the checks are in fact disabled here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I don't think this dataset would return any check errors. I'll make the necessary changes.
|
||
@pytest.mark.parametrize("data_checks", | ||
[[MockDataCheckErrorAndWarning()], | ||
DataChecks([MockDataCheckErrorAndWarning()])]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@pytest.mark.parametrize("data_checks", | ||
[[MockDataCheckErrorAndWarning()], | ||
DataChecks([MockDataCheckErrorAndWarning()])]) | ||
def test_automl_data_checks_raises_error(data_checks, caplog): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note that by default we should mock fit
/score
on automl tests which run search
, if its not too much effort to add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely!
Pull Request Description
Previously, it was awkward to disable data checking in AutoSearchBase.search because the default value was None and it corresponded to DefaultDataChecks. This PR addresses #799 by letting the data_checks parameter be a list, str, DataChecks, or None to give the user more control over the data checks they want to perform.
After creating the pull request: in order to pass the changelog_updated check you will need to update the "Future Release" section of
docs/source/changelog.rst
to include this pull request by adding :pr:123
.