-
Notifications
You must be signed in to change notification settings - Fork 89
Parametrizing DataChecks #1167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parametrizing DataChecks #1167
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1167 +/- ##
=======================================
Coverage 99.92% 99.92%
=======================================
Files 196 196
Lines 12029 12121 +92
=======================================
+ Hits 12020 12112 +92
Misses 9 9
Continue to review full report at Codecov.
|
evalml/automl/automl_search.py
Outdated
@@ -346,10 +345,11 @@ def _handle_keyboard_interrupt(self, pipeline, current_batch_pipelines): | |||
else: | |||
leading_char = "" | |||
|
|||
def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_plot=True): | |||
def search(self, X, y, data_checks="auto", data_check_params=None, show_iteration_plot=True, feature_types=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im skeptical this is the best solution. isn't the point of passing your own argument to data_checks
that you can set the params?
this solution feels weird because it requires the person to know what data checks are being run. if you know this because you set the data_checks
yourself, you should also be able to set params there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that if you're specifying all the params via a dict, you can just create the instances yourself. I removed the data_check_params
from the search api.
9fba156
to
7fdf544
Compare
@@ -7,37 +7,42 @@ | |||
class ClassImbalanceDataCheck(DataCheck): | |||
"""Classification problems, checks if any target labels are imbalanced beyond a threshold""" | |||
|
|||
def validate(self, X, y, threshold=0.10): | |||
def __init__(self, threshold=0.1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to move the threshold
param to the init to match the pattern we use for the other data checks - this will make the inclusion of this data check within automl easier in the future.
1c6fa73
to
eb5144c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton this rocks!
Before merging, let's resolve the conversations I left in DataChecks
about the params name and the validation/init code in __init__
. I also left a question in InvalidTargetDataCheck
which we should close off. And a couple missing unit test cases.
I also think we should add something to the user guide. I see we don't currently discuss DataChecks
at all! We don't have to do it in this PR, can file separatey.
|
||
class DataChecks: | ||
"""A collection of data checks.""" | ||
|
||
def __init__(self, data_checks=None): | ||
def __init__(self, data_checks=None, data_check_params=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we call this parameters
?
evalml/data_checks/data_checks.py
Outdated
else: | ||
raise ValueError("All elements of parameter data_checks must be an instance of DataCheck " | ||
"or a DataCheck class with any desired parameters specified in the " | ||
"data_check_params dictionary.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton got it, so this is the backwards compatibility mechanism. Its great that you thought of this.
We're still in a position where its ok for us to make backwards incompatible changes to the data checks API. I'd be in favor of deleting the backwards compatibility code, and simplifying so that DataChecks
requires a list of classes, not instances. What do you think of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry I am happy to change DataChecks
so that it requires a list of classes and not instances! The one caveat is that I believe we still want to let users pass in a list of DataCheck
instances to AutoMLSearch.search
(to let them modify behavior of the default checks/pass their own data checks). That means we'll still need a mechanism to go from List[DataCheck]
to DataChecks
instance, but that can happen in another class (maybe called AutoMLDataChecks
that is a subclass of DataChecks
but with a different init).
Let me know what you think of this plan!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! This feels similar to what we've done with components. We've added make_pipeline_from_components
which can take in a list of components and return a pipeline instance. Feels like we want the same thing here.
evalml/data_checks/data_checks.py
Outdated
data_check_instances.append(data_check_class(**class_params)) | ||
except TypeError as e: | ||
raise DataCheckInitError(f"Encountered the following error while initializing {data_check_class.name}: {e}") | ||
return data_check_instances |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the validation code you have here, but let's define it in a separate method and call it at the start of __init__
, yeah?
After that, we're left with the following code:
data_check_instances = []
for data_check_class in data_check_classes:
data_check_instances.append(data_check_class(**parameters.get(data_check_class.name, {})))
return data_check_instances
(Could do in one line but maybe the expanded form makes debugging easier)
Perhaps we could just paste that into __init__
since its not much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we'll need a try/except
inside the for-loop to raise a DataCheckInitError
if the users specifies erroneous arguments for a given DataCheck
class.
I can split the init into two methods:
self._validate_data_check_classes(data_checks, params)
self.data_checks = self._init_data_checks(data_checks, params)
What do you think of that? Let me know if I misunderstood your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton but can't each DataCheck
's init be responsible for throwing if the params aren't right? I.e. we delegate to the data checks.
Either way its just a design question, fine to merge with what you've got.
evalml/data_checks/data_checks.py
Outdated
@@ -35,3 +45,25 @@ def validate(self, X, y=None): | |||
messages_new = data_check.validate(X, y) | |||
messages.extend(messages_new) | |||
return messages | |||
|
|||
|
|||
def init_data_checks_from_params(data_check_classes, params): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@staticmethod
, with underscore prefix?
@@ -5,20 +5,20 @@ | |||
from .label_leakage_data_check import LabelLeakageDataCheck | |||
from .no_variance_data_check import NoVarianceDataCheck | |||
|
|||
_default_data_checks_classes = [HighlyNullDataCheck, IDColumnsDataCheck, | |||
LabelLeakageDataCheck, InvalidTargetDataCheck, NoVarianceDataCheck] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Let's define this as a class attribute of DefaultDataChecks? _DEFAULT_DATA_CHECKS_CLASSES
@@ -12,6 +13,9 @@ | |||
class InvalidTargetDataCheck(DataCheck): | |||
"""Checks if the target labels contain missing or invalid data.""" | |||
|
|||
def __init__(self, problem_type): | |||
self.problem_type = handle_problem_types(problem_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
([MockCheck], {"mock_check": {"fo": 3, "ba": 4}}, DataCheckInitError, | ||
r"Encountered the following error while initializing mock_check: __init__\(\) got an unexpected keyword argument 'fo'"), | ||
([MockCheck], {"MockCheck": {"foo": 2, "bar": 4}}, DataCheckInitError, | ||
"Class MockCheck was provided in params dictionary but it does not match any name in in the data_check_classes list."), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are great!
One more: what if "mock_check"
is not provided in the dict? As opposed to also having extra entries like "MockCheck"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh and I think you check that the overall parameters come in as a dict, right? So we should test that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test if the class is not provided in the dict! I think checking that the parameters come in a dict is covered in the test cases in that function!
eb5144c
to
fa96b4b
Compare
>>> threshold = 0.10 | ||
>>> target_check = ClassImbalanceDataCheck() | ||
>>> assert target_check.validate(X, y, threshold) == [DataCheckWarning("The following labels fall below 10% of the target: [0]", "ClassImbalanceDataCheck")] | ||
>>> target_check = ClassImbalanceDataCheck(threshold=0.10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
|
||
leakage = [DataCheckWarning("Column 'has_label_leakage' is 95.0% or more correlated with the target", "LabelLeakageDataCheck")] | ||
|
||
assert data_checks.validate(X, y) == messages[:3] + leakage + messages[3:] | ||
|
||
data_checks = DataChecks(_default_data_checks_classes, {"InvalidTargetDataCheck": {"problem_type": "binary"}}) | ||
assert data_checks.validate(X, y) == messages[:3] + leakage + messages[3:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh this is super cool!
607e8c6
to
2d9a418
Compare
@freddyaboulton I just responded to your comments. Looks good! Anything else you need in order to merge? |
2d9a418
to
87e6d75
Compare
…tionary. DefaultDataChecks is now a list of classes.
9ef2bfb
to
5580fed
Compare
Pull Request Description
Fixes #931 by:
problem_type
parameter to theDefaultDataChecks
__init__
method. This change is not visible to users ofAutoMLSearch
if they pass indata_checks = "auto"
__init__
method of theDataChecks
class. Users can now pass in a list ofDataCheck
instances, or a list ofDataCheck
classes and a params_dict, similar to how we parametrize pipelines. This is backwards compatible with the "old way" of creatingDataChecks
.DataCheckError
if there are not two unique values in a binary problem, so I modifiedInvalidTargetDataCheck
.For AutoMLSearch, the api of the
search
method stays the same. If users want to pass in parameters to the data checks, or use their own data check, they can pass in a list ofDataCheck
instances of aDataChecks
class.After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rst
to include this pull request by adding :pr:123
.