Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change _schema_is_equal to allow nullable and non nullable types to be used interchangably between train and test data #4077

Open
tamargrey opened this issue Mar 14, 2023 · 0 comments

Comments

@tamargrey
Copy link
Contributor

As a user, I wish I could train a pipeline on data that might not have nans and has non nullable types and then predict/transform_all_but_final /score data that has nans and therefore has nullable types.

Currently, having nullable types at train and non nullable types at test (or visa versa) causes ComponentGraph._transform_features to error with Input X data types are different from the input types the pipeline was fitted on, but other than whether or not they may contain null values, nullable types and their non nullable counterparts contain the same type of data.

Once the nullability epic is in place, we may see increased usage of nullable types, which could result in more instances of the above situation popping up.

I propose we change _schema_is_equal to treat the following nullable types interchangably with their non nullable counterparts

  • Integer - IntegerNullable
  • Boolean - BooleanNullable
  • Age - AgeNullable

There are several things to take into account when implementing this:

  • Overall, think about the impact of allowing these types to be used interchangably
  • Consider requiring that we validate the existence of NaNs before treating nullable and non nullable types as equivalent - In general, I don't want us to shy away from keeping IntegerNullable columns as such even if no nans are present (whether we impute them ourselves or users input them). Those types aren't really meant to imply the presence of nans, just that the type can support null values, but other than that they're the same as non nullable integers. For example, in Featuretools, we might output types as IntegerNullable from a Primitive so that users could pass nans in and not have it break.
  • Increase test coverage of the different ways data could be different between train and test data for the different problem types and the score, predict, and transform_all_but_final methods on pipelines
  • Confirm this doesn't change automl results before or after nullable type handling changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant