Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parameterization of data checks; have InvalidTargetDataCheck validate target using problem_type #931

Closed
angela97lin opened this issue Jul 15, 2020 · 6 comments · Fixed by #1167
Assignees
Labels
enhancement An improvement to an existing feature.

Comments

@angela97lin
Copy link
Contributor

angela97lin commented Jul 15, 2020

Fixes #970 .

Per the discussion with @freddyaboulton in #929, it would be nice if we could pass along extra information to DataChecks. This would require updating the DataCheck API and considering how it interacts with AutoML, since we do not instantiate an instance of DataChecks and only pass along a DataChecks class as a parameter to search().

@dsherry
Copy link
Contributor

dsherry commented Jul 22, 2020

@angela97lin could you please describe the use-case for this?

@angela97lin
Copy link
Contributor Author

@dsherry Sure! In #929, @freddyaboulton and I were discussing how it'd be nice if the InvalidTargetDataCheck could be even more useful if it were aware of the type of problem it was handling. For example, if we knew that our problem was binary classification but the input to the data check had more than two classes, we could throw a warning/error. Hence, it'd be nice to be able to pass in parameters somehow or just more information to the data check. Unfortunately, this doesn't work well with our current design, where we pass around classes and not instances.

Alternatively, we could create data check classes for each problem type, such as BinaryClassificationInvalidTargetDataCheck but this could get pretty hairy too, when determining what DefaultDataChecks should include (or should this too be broken down to DefaultBinaryClassificationDataChecks?)

@dsherry
Copy link
Contributor

dsherry commented Jul 23, 2020

Just discussed with @angela97lin @freddyaboulton

We like the idea of mirroring the pattern we use for component_graph in pipelines:

  • The list of data checks can be specified initially to automl search as a list of DataCheck subclasses (or same but inside DataChecks), not instances
  • Once automl search wants to run the data checks, it can create an instance of the DataChecks class
  • At that point we'd pass it a data_check_parameters dict, similar to our pipeline parameters, which contains optional configuration for one or more data checks.
  • If users want to use DataChecks directly they can follow a similar pattern
  • data_check_parameters should default to None so people don't need to create that if its not required. But if a required arg is missing from a data check (like problem_type for some) that should result in an initialization error

Here's a sketch of how this could look in automl search:

# today this helper standardizes the input to a list of `DataCheck` instances, and wraps that in a `DataChecks` instance
# after this work, this would standardize the input to a `DataChecks` class.
# if `data_checks` was already a `DataChecks` class, do nothing. else if `data_checks` is a list of `DataCheck` classes, define a `AutoMLDataChecks` class to wrap and return that
data_checks_class = self._validate_data_checks(data_checks)
# next we create the `DataChecks` instance by passing in data checks parameters
data_check_parameters = {'Target Datatype Data Check': {'problem_type': self.problem_type}}
data_checks = data_checks_class(data_check_parameters)
data_check_results = data_checks.validate(X, y)

Direct usage would look similar.

Next steps

@dsherry dsherry added this to the August 2020 milestone Jul 23, 2020
@freddyaboulton
Copy link
Contributor

@dsherry The plan looks good to to me! The only thing I would add is that I prefer to augment the already existing InvalidTargetDataCheck over creating a new data check but either approach would work for me. Whoever picks this up, please make sure to check that the target only has two unique values when the problem_type is binary. This was mentioned in the review for #929.

if problem_type == "binary" and len(set(y)) != 2:
    # Warn that we do not have two unique values in y

@dsherry
Copy link
Contributor

dsherry commented Jul 23, 2020

You know what, @angela97lin @freddyaboulton let's use this issue to track both a) updating automl and the data checks API to support parameterization and b) updating InvalidTargetDataCheck to validate the target and raise intelligent errors, for all the target types we support.

Mentioning because I just filed a bug #970 and on closer look the issue would be fixed by the above. So this will close #970.

@dsherry dsherry changed the title Augment data checks to include extra parameters (information) Update data checks to support parameterization. Update InvalidTargetDataCheck to validate target based on problem_type Jul 23, 2020
@dsherry dsherry changed the title Update data checks to support parameterization. Update InvalidTargetDataCheck to validate target based on problem_type Support parameterization of data checks; have InvalidTargetDataCheck validate target using problem_type Jul 23, 2020
@angela97lin
Copy link
Contributor Author

@dsherry How timely! That sounds good to me 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement to an existing feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants