Support parameterization of data checks; have InvalidTargetDataCheck validate target using problem_type #931

angela97lin · 2020-07-15T14:49:36Z

Fixes #970 .

Per the discussion with @freddyaboulton in #929, it would be nice if we could pass along extra information to DataChecks. This would require updating the DataCheck API and considering how it interacts with AutoML, since we do not instantiate an instance of DataChecks and only pass along a DataChecks class as a parameter to search().

The text was updated successfully, but these errors were encountered:

dsherry · 2020-07-22T15:31:46Z

@angela97lin could you please describe the use-case for this?

angela97lin · 2020-07-23T04:42:09Z

@dsherry Sure! In #929, @freddyaboulton and I were discussing how it'd be nice if the InvalidTargetDataCheck could be even more useful if it were aware of the type of problem it was handling. For example, if we knew that our problem was binary classification but the input to the data check had more than two classes, we could throw a warning/error. Hence, it'd be nice to be able to pass in parameters somehow or just more information to the data check. Unfortunately, this doesn't work well with our current design, where we pass around classes and not instances.

Alternatively, we could create data check classes for each problem type, such as BinaryClassificationInvalidTargetDataCheck but this could get pretty hairy too, when determining what DefaultDataChecks should include (or should this too be broken down to DefaultBinaryClassificationDataChecks?)

dsherry · 2020-07-23T19:01:11Z

Just discussed with @angela97lin @freddyaboulton

We like the idea of mirroring the pattern we use for component_graph in pipelines:

The list of data checks can be specified initially to automl search as a list of DataCheck subclasses (or same but inside DataChecks), not instances
Once automl search wants to run the data checks, it can create an instance of the DataChecks class
At that point we'd pass it a data_check_parameters dict, similar to our pipeline parameters, which contains optional configuration for one or more data checks.
If users want to use DataChecks directly they can follow a similar pattern
data_check_parameters should default to None so people don't need to create that if its not required. But if a required arg is missing from a data check (like problem_type for some) that should result in an initialization error

Here's a sketch of how this could look in automl search:

# today this helper standardizes the input to a list of `DataCheck` instances, and wraps that in a `DataChecks` instance
# after this work, this would standardize the input to a `DataChecks` class.
# if `data_checks` was already a `DataChecks` class, do nothing. else if `data_checks` is a list of `DataCheck` classes, define a `AutoMLDataChecks` class to wrap and return that
data_checks_class = self._validate_data_checks(data_checks)
# next we create the `DataChecks` instance by passing in data checks parameters
data_check_parameters = {'Target Datatype Data Check': {'problem_type': self.problem_type}}
data_checks = data_checks_class(data_check_parameters)
data_check_results = data_checks.validate(X, y)

Direct usage would look similar.

Next steps

@angela97lin @freddyaboulton others review the above sketch and sanity-check it
@angela97lin will file an issue to track adding a TargetDatatype data check (name TBD), based on our discussion on Add testing / throw a clear exception for regression problems if target type is string / categorical #960 . That data check would require a problem_type parameter to be passed in
Whoever picks up this issue should also pick up that TargetDatatype issue at the same tim and build this! 🛠️ 😁

freddyaboulton · 2020-07-23T20:41:58Z

@dsherry The plan looks good to to me! The only thing I would add is that I prefer to augment the already existing InvalidTargetDataCheck over creating a new data check but either approach would work for me. Whoever picks this up, please make sure to check that the target only has two unique values when the problem_type is binary. This was mentioned in the review for #929.

if problem_type == "binary" and len(set(y)) != 2:
    # Warn that we do not have two unique values in y

dsherry · 2020-07-23T21:07:58Z

You know what, @angela97lin @freddyaboulton let's use this issue to track both a) updating automl and the data checks API to support parameterization and b) updating InvalidTargetDataCheck to validate the target and raise intelligent errors, for all the target types we support.

Mentioning because I just filed a bug #970 and on closer look the issue would be fixed by the above. So this will close #970.

angela97lin · 2020-07-23T21:20:55Z

@dsherry How timely! That sounds good to me 😊

angela97lin mentioned this issue Jul 15, 2020

Add additional checks to InvalidTargetDataCheck to handle invalid target data types #929

Merged

angela97lin mentioned this issue Jul 23, 2020

Add testing / throw a clear exception for regression problems if target type is string / categorical #960

Merged

dsherry added this to the August 2020 milestone Jul 23, 2020

dsherry changed the title ~~Augment data checks to include extra parameters (information)~~ Update data checks to support parameterization. Update InvalidTargetDataCheck to validate target based on problem_type Jul 23, 2020

dsherry changed the title ~~Update data checks to support parameterization. Update InvalidTargetDataCheck to validate target based on problem_type~~ Support parameterization of data checks; have InvalidTargetDataCheck validate target using problem_type Jul 23, 2020

dsherry mentioned this issue Jul 23, 2020

Str target failure in automl: "ValueError: y contains previously unseen labels: [2.0]" (Looking Glass) #970

Closed

dsherry mentioned this issue Jul 23, 2020

Add data check for severe class imbalance #971

Closed

dsherry added the enhancement An improvement to an existing feature. label Aug 19, 2020

dsherry modified the milestones: August 2020, September 2020 Aug 27, 2020

angela97lin assigned angela97lin and unassigned angela97lin Sep 11, 2020

freddyaboulton self-assigned this Sep 11, 2020

freddyaboulton mentioned this issue Sep 14, 2020

Parametrizing DataChecks #1167

Merged

freddyaboulton closed this as completed in #1167 Sep 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support parameterization of data checks; have InvalidTargetDataCheck validate target using problem_type #931

Support parameterization of data checks; have InvalidTargetDataCheck validate target using problem_type #931

angela97lin commented Jul 15, 2020 •

edited by dsherry

dsherry commented Jul 22, 2020

angela97lin commented Jul 23, 2020

dsherry commented Jul 23, 2020

freddyaboulton commented Jul 23, 2020

dsherry commented Jul 23, 2020

angela97lin commented Jul 23, 2020

Support parameterization of data checks; have InvalidTargetDataCheck validate target using problem_type #931

Support parameterization of data checks; have InvalidTargetDataCheck validate target using problem_type #931

Comments

angela97lin commented Jul 15, 2020 • edited by dsherry

dsherry commented Jul 22, 2020

angela97lin commented Jul 23, 2020

dsherry commented Jul 23, 2020

freddyaboulton commented Jul 23, 2020

dsherry commented Jul 23, 2020

angela97lin commented Jul 23, 2020

angela97lin commented Jul 15, 2020 •

edited by dsherry