Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parametrizing DataChecks #1167

Merged
merged 13 commits into from
Sep 28, 2020
Merged

Parametrizing DataChecks #1167

merged 13 commits into from
Sep 28, 2020

Conversation

freddyaboulton
Copy link
Contributor

@freddyaboulton freddyaboulton commented Sep 14, 2020

Pull Request Description

Fixes #931 by:

  • Adding a problem_type parameter to the DefaultDataChecks __init__ method. This change is not visible to users of AutoMLSearch if they pass in data_checks = "auto"
  • This was done by modifying the __init__ method of the DataChecks class. Users can now pass in a list of DataCheck instances, or a list of DataCheck classes and a params_dict, similar to how we parametrize pipelines. This is backwards compatible with the "old way" of creating DataChecks.
  • The original issue tracks raising a DataCheckError if there are not two unique values in a binary problem, so I modified InvalidTargetDataCheck.

For AutoMLSearch, the api of the search method stays the same. If users want to pass in parameters to the data checks, or use their own data check, they can pass in a list of DataCheck instances of a DataChecks class.


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Sep 14, 2020

Codecov Report

Merging #1167 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1167   +/-   ##
=======================================
  Coverage   99.92%   99.92%           
=======================================
  Files         196      196           
  Lines       12029    12121   +92     
=======================================
+ Hits        12020    12112   +92     
  Misses          9        9           
Impacted Files Coverage Δ
evalml/automl/automl_search.py 99.58% <100.00%> (-0.01%) ⬇️
evalml/data_checks/__init__.py 100.00% <100.00%> (ø)
evalml/data_checks/class_imbalance_data_check.py 100.00% <100.00%> (ø)
evalml/data_checks/data_checks.py 100.00% <100.00%> (ø)
evalml/data_checks/default_data_checks.py 100.00% <100.00%> (ø)
evalml/data_checks/invalid_targets_data_check.py 100.00% <100.00%> (ø)
evalml/exceptions/exceptions.py 100.00% <100.00%> (ø)
evalml/tests/automl_tests/test_automl.py 100.00% <100.00%> (ø)
...ta_checks_tests/test_class_imbalance_data_check.py 100.00% <100.00%> (ø)
evalml/tests/data_checks_tests/test_data_checks.py 100.00% <100.00%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6bf1998...5580fed. Read the comment docs.

@@ -346,10 +345,11 @@ def _handle_keyboard_interrupt(self, pipeline, current_batch_pipelines):
else:
leading_char = ""

def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_plot=True):
def search(self, X, y, data_checks="auto", data_check_params=None, show_iteration_plot=True, feature_types=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im skeptical this is the best solution. isn't the point of passing your own argument to data_checks that you can set the params?

this solution feels weird because it requires the person to know what data checks are being run. if you know this because you set the data_checks yourself, you should also be able to set params there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that if you're specifying all the params via a dict, you can just create the instances yourself. I removed the data_check_params from the search api.

@@ -7,37 +7,42 @@
class ClassImbalanceDataCheck(DataCheck):
"""Classification problems, checks if any target labels are imbalanced beyond a threshold"""

def validate(self, X, y, threshold=0.10):
def __init__(self, threshold=0.1):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to move the threshold param to the init to match the pattern we use for the other data checks - this will make the inclusion of this data check within automl easier in the future.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton this rocks!

Before merging, let's resolve the conversations I left in DataChecks about the params name and the validation/init code in __init__. I also left a question in InvalidTargetDataCheck which we should close off. And a couple missing unit test cases.

I also think we should add something to the user guide. I see we don't currently discuss DataChecks at all! We don't have to do it in this PR, can file separatey.


class DataChecks:
"""A collection of data checks."""

def __init__(self, data_checks=None):
def __init__(self, data_checks=None, data_check_params=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we call this parameters?

else:
raise ValueError("All elements of parameter data_checks must be an instance of DataCheck "
"or a DataCheck class with any desired parameters specified in the "
"data_check_params dictionary.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton got it, so this is the backwards compatibility mechanism. Its great that you thought of this.

We're still in a position where its ok for us to make backwards incompatible changes to the data checks API. I'd be in favor of deleting the backwards compatibility code, and simplifying so that DataChecks requires a list of classes, not instances. What do you think of that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry I am happy to change DataChecks so that it requires a list of classes and not instances! The one caveat is that I believe we still want to let users pass in a list of DataCheck instances to AutoMLSearch.search (to let them modify behavior of the default checks/pass their own data checks). That means we'll still need a mechanism to go from List[DataCheck] to DataChecks instance, but that can happen in another class (maybe called AutoMLDataChecks that is a subclass of DataChecks but with a different init).

Let me know what you think of this plan!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! This feels similar to what we've done with components. We've added make_pipeline_from_components which can take in a list of components and return a pipeline instance. Feels like we want the same thing here.

data_check_instances.append(data_check_class(**class_params))
except TypeError as e:
raise DataCheckInitError(f"Encountered the following error while initializing {data_check_class.name}: {e}")
return data_check_instances
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the validation code you have here, but let's define it in a separate method and call it at the start of __init__, yeah?

After that, we're left with the following code:

data_check_instances = []
for data_check_class in data_check_classes:
    data_check_instances.append(data_check_class(**parameters.get(data_check_class.name, {})))
return data_check_instances

(Could do in one line but maybe the expanded form makes debugging easier)

Perhaps we could just paste that into __init__ since its not much?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we'll need a try/except inside the for-loop to raise a DataCheckInitError if the users specifies erroneous arguments for a given DataCheck class.

I can split the init into two methods:

self._validate_data_check_classes(data_checks, params)
self.data_checks = self._init_data_checks(data_checks, params)

What do you think of that? Let me know if I misunderstood your suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton but can't each DataCheck's init be responsible for throwing if the params aren't right? I.e. we delegate to the data checks.

Either way its just a design question, fine to merge with what you've got.

@@ -35,3 +45,25 @@ def validate(self, X, y=None):
messages_new = data_check.validate(X, y)
messages.extend(messages_new)
return messages


def init_data_checks_from_params(data_check_classes, params):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@staticmethod, with underscore prefix?

@@ -5,20 +5,20 @@
from .label_leakage_data_check import LabelLeakageDataCheck
from .no_variance_data_check import NoVarianceDataCheck

_default_data_checks_classes = [HighlyNullDataCheck, IDColumnsDataCheck,
LabelLeakageDataCheck, InvalidTargetDataCheck, NoVarianceDataCheck]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Let's define this as a class attribute of DefaultDataChecks? _DEFAULT_DATA_CHECKS_CLASSES

@@ -12,6 +13,9 @@
class InvalidTargetDataCheck(DataCheck):
"""Checks if the target labels contain missing or invalid data."""

def __init__(self, problem_type):
self.problem_type = handle_problem_types(problem_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

evalml/exceptions/exceptions.py Outdated Show resolved Hide resolved
([MockCheck], {"mock_check": {"fo": 3, "ba": 4}}, DataCheckInitError,
r"Encountered the following error while initializing mock_check: __init__\(\) got an unexpected keyword argument 'fo'"),
([MockCheck], {"MockCheck": {"foo": 2, "bar": 4}}, DataCheckInitError,
"Class MockCheck was provided in params dictionary but it does not match any name in in the data_check_classes list."),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are great!

One more: what if "mock_check" is not provided in the dict? As opposed to also having extra entries like "MockCheck"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and I think you check that the overall parameters come in as a dict, right? So we should test that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test if the class is not provided in the dict! I think checking that the parameters come in a dict is covered in the test cases in that function!

>>> threshold = 0.10
>>> target_check = ClassImbalanceDataCheck()
>>> assert target_check.validate(X, y, threshold) == [DataCheckWarning("The following labels fall below 10% of the target: [0]", "ClassImbalanceDataCheck")]
>>> target_check = ClassImbalanceDataCheck(threshold=0.10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

evalml/data_checks/data_checks.py Outdated Show resolved Hide resolved

leakage = [DataCheckWarning("Column 'has_label_leakage' is 95.0% or more correlated with the target", "LabelLeakageDataCheck")]

assert data_checks.validate(X, y) == messages[:3] + leakage + messages[3:]

data_checks = DataChecks(_default_data_checks_classes, {"InvalidTargetDataCheck": {"problem_type": "binary"}})
assert data_checks.validate(X, y) == messages[:3] + leakage + messages[3:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh this is super cool!

@freddyaboulton freddyaboulton force-pushed the 931-parametrized-data-checks branch 3 times, most recently from 607e8c6 to 2d9a418 Compare September 23, 2020 18:12
@dsherry
Copy link
Contributor

dsherry commented Sep 24, 2020

@freddyaboulton I just responded to your comments. Looks good! Anything else you need in order to merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support parameterization of data checks; have InvalidTargetDataCheck validate target using problem_type
4 participants