Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop BooleanNullable for replace and reinit WW #3678

Merged
merged 18 commits into from
Aug 29, 2022

Conversation

jeremyliweishih
Copy link
Collaborator

@jeremyliweishih jeremyliweishih commented Aug 22, 2022

This PR adds BooleanNullable and IntegerNullable support in imputer.py by converting the column to Boolean or Float respectively after imputing. Because of this - we can drop the ReplaceNullableTypes transformer when input contains BooleanNullable.

@codecov
Copy link

codecov bot commented Aug 22, 2022

Codecov Report

Merging #3678 (a80df42) into main (59fb0ff) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3678     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        337     337             
  Lines      34018   34067     +49     
=======================================
+ Hits       33887   33936     +49     
  Misses       131     131             
Impacted Files Coverage Δ
...valml/tests/automl_tests/test_default_algorithm.py 100.0% <ø> (ø)
evalml/utils/__init__.py 100.0% <ø> (ø)
...elines/components/transformers/imputers/imputer.py 100.0% <100.0%> (ø)
...ansformers/preprocessing/replace_nullable_types.py 100.0% <100.0%> (ø)
evalml/pipelines/utils.py 99.5% <100.0%> (+0.1%) ⬆️
evalml/tests/component_tests/test_imputer.py 100.0% <100.0%> (ø)
...ts/component_tests/test_nullable_types_replacer.py 100.0% <100.0%> (ø)
...lml/tests/integration_tests/test_nullable_types.py 100.0% <100.0%> (ø)
evalml/tests/pipeline_tests/test_pipeline_utils.py 99.8% <100.0%> (ø)
evalml/utils/woodwork_utils.py 100.0% <100.0%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@jeremyliweishih jeremyliweishih marked this pull request as ready for review August 22, 2022 20:18
@@ -162,6 +162,7 @@ def transform(self, X, y=None):
X_numeric = X.ww[self._numeric_cols.tolist()]
imputed = self._numeric_imputer.transform(X_numeric)
X_no_all_null[X_numeric.columns] = imputed
X_no_all_null = downcast_integer_nullable_to_double(X_no_all_null)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too sure about this @chukarsten but I had to add this to for imputer test cases that had IntegerNullable and BooleanNullable. This case wouldn't show up in AutoML since we're replacing IntegerNullable using the ReplaceNullableTypes transformer. I can make the corresponding changes to make_pipeline so that we handle both IntegerNullable and BooleanNullable the same way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For more info: the error I was getting was that there was a logical type mismatch with the IntegerNullable columns when calling downcast_boolean_nullable_to_double(). Not too sure about the reasoning but my guess was that there was a mismatch in the IntegerNullable and the imputed numeric columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to treat them the same way. If we did, could we set the imputed integer columns to Integer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fjlanasa sounds good. It has to be a float because we can impute by the mean of a column. However, I could set it to Integer when we use "median", "most_frequent", "constant" but I wanted the type to be consistent coming out of the imputer. Do you think I should change it to Integer based off of the imputation strategy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok. Whatever you think makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for "most frequent" strategies, it makes sense to retain the non-null type of the most frequent thing that is being imputed. For "median," depending on how many items are in the feature, a nullable integer could be an integer or a float, so I'd think in that case we'd want to go float. For "constant", we're kind of in a similar situation - the user could supply a float or an integer in numeric cases. Whatever the constant's dtype should probably determine the output dtype.

Ultimately, I think that integer is going to be a thing of the past. numpy is working on the Float64 dtype. I think the future for numeric imputation just involves working with nullable floats and flushing integers down the toilet. Along those lines, that makes me feel like we want to move everything towards floats after imputation. The future of imputation could look like numeric data flowing through our pipelines as nullable floats, both before AND after the imputer.

return X


def downcast_boolean_nullable_to_double(X):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shoutout to @ParthivNaresh for these two methods 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are both the same function but with different values for source and destination logical types, yes? Perhaps we could write it that way and just call the generic downcast with arguments.

Also, this function definitely does what we want it to? If I have an IntegerNullable column and init as Double, then I get floats and numpy.nan? I guess BooleanNullable -> Double is similar, yes? Is it just treated as 2 integer that get cast to float with numpy.nan?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can replace the name to downcast_boolean_nullable_to_boolean 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also update the description :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chukarsten the column should have already been imputed (the NaN values shouldn't exist) so these functions are just updated the columns to match the imputed values. It would be from IntegerNullable ---> Integer or Double and BooleanNullable to Boolean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duh. Thanks.

@jeremyliweishih
Copy link
Collaborator Author

Still need to add a test case for the impl but going to open it up for review.

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this satisfies @dvreed77 , it satisfies me!

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Just left a couple nits/clarifying questions.

Comment on lines +86 to +93
types_replace_null_handles = [
logical_types.AgeNullable,
logical_types.Boolean,
logical_types.BooleanNullable,
logical_types.Double,
logical_types.Integer,
logical_types.IntegerNullable,
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we're going to have a Replace Nullable Types Transformer in almost every pipeline moving forward, even if there are no nulls or nullable types in the input data? Could you explain a little more about why that is? I'm rather uninformed w/r/t all the nullable types work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's the plan. We're doing this to make sure that nullables are handled correctly either by ReplaceNullableTypes (when there isn't nan values) or by the imputer (when there are nan values). In the future the nullable types should be handle on the component or estimator level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm still missing something here. Are logical_types.Boolean/Double/Integer nullable? Or is this to cover the case where there's no nullable types in the training data but there is in the test, similar to the reason we backed out conditionally including the imputer.

Copy link
Collaborator Author

@jeremyliweishih jeremyliweishih Aug 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay yup the latter is what I was going for!

@jeremyliweishih jeremyliweishih merged commit 6f82c82 into main Aug 29, 2022
@jeremyliweishih jeremyliweishih deleted the js_test_nullable_boolean branch August 29, 2022 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants