Parametrizing DataChecks #1167

freddyaboulton · 2020-09-14T19:22:46Z

Pull Request Description

Fixes #931 by:

Adding a problem_type parameter to the DefaultDataChecks __init__ method. This change is not visible to users of AutoMLSearch if they pass in data_checks = "auto"
This was done by modifying the __init__ method of the DataChecks class. Users can now pass in a list of DataCheck instances, or a list of DataCheck classes and a params_dict, similar to how we parametrize pipelines. This is backwards compatible with the "old way" of creating DataChecks.
The original issue tracks raising a DataCheckError if there are not two unique values in a binary problem, so I modified InvalidTargetDataCheck.

For AutoMLSearch, the api of the search method stays the same. If users want to pass in parameters to the data checks, or use their own data check, they can pass in a list of DataCheck instances of a DataChecks class.

After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

codecov · 2020-09-14T19:26:10Z

Codecov Report

Merging #1167 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1167   +/-   ##
=======================================
  Coverage   99.92%   99.92%           
=======================================
  Files         196      196           
  Lines       12029    12121   +92     
=======================================
+ Hits        12020    12112   +92     
  Misses          9        9

Impacted Files	Coverage Δ
evalml/automl/automl_search.py	`99.58% <100.00%> (-0.01%)`	⬇️
evalml/data_checks/__init__.py	`100.00% <100.00%> (ø)`
evalml/data_checks/class_imbalance_data_check.py	`100.00% <100.00%> (ø)`
evalml/data_checks/data_checks.py	`100.00% <100.00%> (ø)`
evalml/data_checks/default_data_checks.py	`100.00% <100.00%> (ø)`
evalml/data_checks/invalid_targets_data_check.py	`100.00% <100.00%> (ø)`
evalml/exceptions/exceptions.py	`100.00% <100.00%> (ø)`
evalml/tests/automl_tests/test_automl.py	`100.00% <100.00%> (ø)`
...ta_checks_tests/test_class_imbalance_data_check.py	`100.00% <100.00%> (ø)`
evalml/tests/data_checks_tests/test_data_checks.py	`100.00% <100.00%> (ø)`
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6bf1998...5580fed. Read the comment docs.

kmax12 · 2020-09-14T20:00:53Z

evalml/automl/automl_search.py

@@ -346,10 +345,11 @@ def _handle_keyboard_interrupt(self, pipeline, current_batch_pipelines):
            else:
                leading_char = ""

-    def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_plot=True):
+    def search(self, X, y, data_checks="auto", data_check_params=None, show_iteration_plot=True, feature_types=None):


im skeptical this is the best solution. isn't the point of passing your own argument to data_checks that you can set the params?

this solution feels weird because it requires the person to know what data checks are being run. if you know this because you set the data_checks yourself, you should also be able to set params there.

You're right that if you're specifying all the params via a dict, you can just create the instances yourself. I removed the data_check_params from the search api.

freddyaboulton · 2020-09-14T21:09:54Z

evalml/data_checks/class_imbalance_data_check.py

@@ -7,37 +7,42 @@
 class ClassImbalanceDataCheck(DataCheck):
    """Classification problems, checks if any target labels are imbalanced beyond a threshold"""

-    def validate(self, X, y, threshold=0.10):
+    def __init__(self, threshold=0.1):


Had to move the threshold param to the init to match the pattern we use for the other data checks - this will make the inclusion of this data check within automl easier in the future.

dsherry

@freddyaboulton this rocks!

Before merging, let's resolve the conversations I left in DataChecks about the params name and the validation/init code in __init__. I also left a question in InvalidTargetDataCheck which we should close off. And a couple missing unit test cases.

I also think we should add something to the user guide. I see we don't currently discuss DataChecks at all! We don't have to do it in this PR, can file separatey.

dsherry · 2020-09-17T23:20:37Z

evalml/data_checks/data_checks.py


 class DataChecks:
    """A collection of data checks."""

-    def __init__(self, data_checks=None):
+    def __init__(self, data_checks=None, data_check_params=None):


Could we call this parameters?

dsherry · 2020-09-17T23:27:26Z

evalml/data_checks/data_checks.py

+        else:
+            raise ValueError("All elements of parameter data_checks must be an instance of DataCheck "
+                             "or a DataCheck class with any desired parameters specified in the "
+                             "data_check_params dictionary.")


@freddyaboulton got it, so this is the backwards compatibility mechanism. Its great that you thought of this.

We're still in a position where its ok for us to make backwards incompatible changes to the data checks API. I'd be in favor of deleting the backwards compatibility code, and simplifying so that DataChecks requires a list of classes, not instances. What do you think of that?

@dsherry I am happy to change DataChecks so that it requires a list of classes and not instances! The one caveat is that I believe we still want to let users pass in a list of DataCheck instances to AutoMLSearch.search (to let them modify behavior of the default checks/pass their own data checks). That means we'll still need a mechanism to go from List[DataCheck] to DataChecks instance, but that can happen in another class (maybe called AutoMLDataChecks that is a subclass of DataChecks but with a different init).

Let me know what you think of this plan!

Sounds good! This feels similar to what we've done with components. We've added make_pipeline_from_components which can take in a list of components and return a pipeline instance. Feels like we want the same thing here.

dsherry · 2020-09-18T00:33:55Z

evalml/data_checks/data_checks.py

+            data_check_instances.append(data_check_class(**class_params))
+        except TypeError as e:
+            raise DataCheckInitError(f"Encountered the following error while initializing {data_check_class.name}: {e}")
+    return data_check_instances


I like the validation code you have here, but let's define it in a separate method and call it at the start of __init__, yeah?

After that, we're left with the following code:

data_check_instances = [] for data_check_class in data_check_classes: data_check_instances.append(data_check_class(**parameters.get(data_check_class.name, {}))) return data_check_instances

(Could do in one line but maybe the expanded form makes debugging easier)

Perhaps we could just paste that into __init__ since its not much?

I still think we'll need a try/except inside the for-loop to raise a DataCheckInitError if the users specifies erroneous arguments for a given DataCheck class.

I can split the init into two methods:

self._validate_data_check_classes(data_checks, params) self.data_checks = self._init_data_checks(data_checks, params)

What do you think of that? Let me know if I misunderstood your suggestion.

@freddyaboulton but can't each DataCheck's init be responsible for throwing if the params aren't right? I.e. we delegate to the data checks.

Either way its just a design question, fine to merge with what you've got.

dsherry · 2020-09-18T00:34:59Z

evalml/data_checks/data_checks.py

@@ -35,3 +45,25 @@ def validate(self, X, y=None):
            messages_new = data_check.validate(X, y)
            messages.extend(messages_new)
        return messages
+
+
+def init_data_checks_from_params(data_check_classes, params):


@staticmethod, with underscore prefix?

dsherry · 2020-09-18T00:36:32Z

evalml/data_checks/default_data_checks.py

@@ -5,20 +5,20 @@
 from .label_leakage_data_check import LabelLeakageDataCheck
 from .no_variance_data_check import NoVarianceDataCheck

+_default_data_checks_classes = [HighlyNullDataCheck, IDColumnsDataCheck,
+                                LabelLeakageDataCheck, InvalidTargetDataCheck, NoVarianceDataCheck]


Looks good. Let's define this as a class attribute of DefaultDataChecks? _DEFAULT_DATA_CHECKS_CLASSES

dsherry · 2020-09-18T00:37:00Z

evalml/data_checks/invalid_targets_data_check.py

@@ -12,6 +13,9 @@
 class InvalidTargetDataCheck(DataCheck):
    """Checks if the target labels contain missing or invalid data."""

+    def __init__(self, problem_type):
+        self.problem_type = handle_problem_types(problem_type)


evalml/data_checks/invalid_targets_data_check.py

evalml/exceptions/exceptions.py

dsherry · 2020-09-18T00:44:58Z

evalml/tests/data_checks_tests/test_data_checks.py

+                          ([MockCheck], {"mock_check": {"fo": 3, "ba": 4}}, DataCheckInitError,
+                           r"Encountered the following error while initializing mock_check: __init__\(\) got an unexpected keyword argument 'fo'"),
+                          ([MockCheck], {"MockCheck": {"foo": 2, "bar": 4}}, DataCheckInitError,
+                           "Class MockCheck was provided in params dictionary but it does not match any name in in the data_check_classes list."),


These are great!

One more: what if "mock_check" is not provided in the dict? As opposed to also having extra entries like "MockCheck"

Oh and I think you check that the overall parameters come in as a dict, right? So we should test that

Added a test if the class is not provided in the dict! I think checking that the parameters come in a dict is covered in the test cases in that function!

bchen1116 · 2020-09-21T14:53:04Z

evalml/data_checks/class_imbalance_data_check.py

-            >>> threshold = 0.10
-            >>> target_check = ClassImbalanceDataCheck()
-            >>> assert target_check.validate(X, y, threshold) == [DataCheckWarning("The following labels fall below 10% of the target: [0]", "ClassImbalanceDataCheck")]
+            >>> target_check = ClassImbalanceDataCheck(threshold=0.10)


Looks good!

evalml/data_checks/data_checks.py

bchen1116 · 2020-09-21T15:01:48Z

evalml/tests/data_checks_tests/test_data_checks.py


    leakage = [DataCheckWarning("Column 'has_label_leakage' is 95.0% or more correlated with the target", "LabelLeakageDataCheck")]

    assert data_checks.validate(X, y) == messages[:3] + leakage + messages[3:]

+    data_checks = DataChecks(_default_data_checks_classes, {"InvalidTargetDataCheck": {"problem_type": "binary"}})
+    assert data_checks.validate(X, y) == messages[:3] + leakage + messages[3:]


oh this is super cool!

dsherry · 2020-09-24T22:30:55Z

@freddyaboulton I just responded to your comments. Looks good! Anything else you need in order to merge?

…aChecker.

…tionary. DefaultDataChecks is now a list of classes.

… api.

… method.

… have default.

…classes.

kmax12 reviewed Sep 14, 2020

View reviewed changes

freddyaboulton force-pushed the 931-parametrized-data-checks branch from 9fba156 to 7fdf544 Compare September 14, 2020 20:58

freddyaboulton commented Sep 14, 2020

View reviewed changes

freddyaboulton marked this pull request as ready for review September 14, 2020 21:39

freddyaboulton requested review from dsherry, angela97lin, jeremyliweishih and christopherbunn September 14, 2020 21:40

auto-assign bot assigned freddyaboulton Sep 14, 2020

freddyaboulton requested review from bchen1116, eccabay and kmax12 September 14, 2020 21:40

angela97lin mentioned this pull request Sep 15, 2020

Label leakage: use mutual information #927

Closed

freddyaboulton force-pushed the 931-parametrized-data-checks branch from 1c6fa73 to eb5144c Compare September 16, 2020 14:41

dsherry approved these changes Sep 18, 2020

View reviewed changes

freddyaboulton force-pushed the 931-parametrized-data-checks branch from eb5144c to fa96b4b Compare September 18, 2020 19:05

bchen1116 reviewed Sep 21, 2020

View reviewed changes

freddyaboulton force-pushed the 931-parametrized-data-checks branch 3 times, most recently from 607e8c6 to 2d9a418 Compare September 23, 2020 18:12

freddyaboulton mentioned this pull request Sep 25, 2020

Add a data check to validate problem type #818

Closed

freddyaboulton force-pushed the 931-parametrized-data-checks branch from 2d9a418 to 87e6d75 Compare September 25, 2020 18:27

freddyaboulton mentioned this pull request Sep 25, 2020

Add DataChecks to the User Guide #1227

Closed

freddyaboulton added 5 commits September 28, 2020 11:36

Passing threshold at to init instead of validate in ClassImbalanceDat…

dcd24c0

…aChecker.

DataChecks can now init from a list of class names and parameters dic…

b9a7613

…tionary. DefaultDataChecks is now a list of classes.

Changing DefaultDataChecks from list to DataChecks subclass - cleaner…

e7035d8

… api.

Adding test coverage for new DataChecks init.

99a8db9

InvalidTargetsDataCheck now makes use of the problem type.

8e00477

freddyaboulton added 8 commits September 28, 2020 11:36

Linting files.

9f12917

Adding PR 1167 to release notes.

51e108b

Fixing test_automl_can_init_data_checks_from_classes for coverage.

012d602

Removing data_check_params from AutoML.search api.

2aad374

Fixing docstring in DefaultDataChecks init.

c5148a4

Reverting accidental change to parameter order of AutoMLSearch search…

41f672d

… method.

Adding AutoMLDataChecks. Raise DataCheckInitError when class does not…

b108cf4

… have default.

Adding test case to test_automl where data_checks contains a list of …

5580fed

…classes.

freddyaboulton force-pushed the 931-parametrized-data-checks branch from 9ef2bfb to 5580fed Compare September 28, 2020 15:36

freddyaboulton merged commit 20f4bb3 into main Sep 28, 2020

freddyaboulton deleted the 931-parametrized-data-checks branch September 28, 2020 16:29

freddyaboulton mentioned this pull request Sep 29, 2020

Str target failure in automl: "ValueError: y contains previously unseen labels: [2.0]" (Looking Glass) #970

Closed

angela97lin mentioned this pull request Sep 29, 2020

Release v0.14.1 #1241

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parametrizing DataChecks #1167

Parametrizing DataChecks #1167

freddyaboulton commented Sep 14, 2020 •

edited

Loading

codecov bot commented Sep 14, 2020 •

edited

Loading

kmax12 Sep 14, 2020

freddyaboulton Sep 14, 2020

freddyaboulton Sep 14, 2020

dsherry left a comment •

edited

Loading

dsherry Sep 17, 2020

dsherry Sep 17, 2020

freddyaboulton Sep 18, 2020

dsherry Sep 24, 2020

dsherry Sep 18, 2020

freddyaboulton Sep 18, 2020

dsherry Sep 24, 2020

dsherry Sep 18, 2020

dsherry Sep 18, 2020

dsherry Sep 18, 2020

dsherry Sep 18, 2020

dsherry Sep 18, 2020

freddyaboulton Sep 25, 2020

bchen1116 Sep 21, 2020

bchen1116 Sep 21, 2020

dsherry commented Sep 24, 2020

Parametrizing DataChecks #1167

Parametrizing DataChecks #1167

Conversation

freddyaboulton commented Sep 14, 2020 • edited Loading

Pull Request Description

codecov bot commented Sep 14, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsherry left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsherry commented Sep 24, 2020

freddyaboulton commented Sep 14, 2020 •

edited

Loading

codecov bot commented Sep 14, 2020 •

edited

Loading

dsherry left a comment •

edited

Loading