-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop BooleanNullable for replace and reinit WW #3678
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3678 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 337 337
Lines 34018 34067 +49
=======================================
+ Hits 33887 33936 +49
Misses 131 131
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@@ -162,6 +162,7 @@ def transform(self, X, y=None): | |||
X_numeric = X.ww[self._numeric_cols.tolist()] | |||
imputed = self._numeric_imputer.transform(X_numeric) | |||
X_no_all_null[X_numeric.columns] = imputed | |||
X_no_all_null = downcast_integer_nullable_to_double(X_no_all_null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too sure about this @chukarsten but I had to add this to for imputer test cases that had IntegerNullable
and BooleanNullable
. This case wouldn't show up in AutoML since we're replacing IntegerNullable
using the ReplaceNullableTypes
transformer. I can make the corresponding changes to make_pipeline
so that we handle both IntegerNullable
and BooleanNullable
the same way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For more info: the error I was getting was that there was a logical type mismatch with the IntegerNullable columns when calling downcast_boolean_nullable_to_double()
. Not too sure about the reasoning but my guess was that there was a mismatch in the IntegerNullable and the imputed numeric columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to treat them the same way. If we did, could we set the imputed integer columns to Integer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fjlanasa sounds good. It has to be a float because we can impute by the mean
of a column. However, I could set it to Integer
when we use "median", "most_frequent", "constant" but I wanted the type to be consistent coming out of the imputer. Do you think I should change it to Integer
based off of the imputation strategy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok. Whatever you think makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for "most frequent" strategies, it makes sense to retain the non-null type of the most frequent thing that is being imputed. For "median," depending on how many items are in the feature, a nullable integer could be an integer or a float, so I'd think in that case we'd want to go float. For "constant", we're kind of in a similar situation - the user could supply a float or an integer in numeric cases. Whatever the constant's dtype should probably determine the output dtype.
Ultimately, I think that integer is going to be a thing of the past. numpy is working on the Float64 dtype. I think the future for numeric imputation just involves working with nullable floats and flushing integers down the toilet. Along those lines, that makes me feel like we want to move everything towards floats after imputation. The future of imputation could look like numeric data flowing through our pipelines as nullable floats, both before AND after the imputer.
return X | ||
|
||
|
||
def downcast_boolean_nullable_to_double(X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shoutout to @ParthivNaresh for these two methods 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are both the same function but with different values for source and destination logical types, yes? Perhaps we could write it that way and just call the generic downcast with arguments.
Also, this function definitely does what we want it to? If I have an IntegerNullable column and init as Double, then I get floats and numpy.nan? I guess BooleanNullable -> Double is similar, yes? Is it just treated as 2 integer that get cast to float with numpy.nan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can replace the name to downcast_boolean_nullable_to_boolean
😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also update the description :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chukarsten the column should have already been imputed (the NaN values shouldn't exist) so these functions are just updated the columns to match the imputed values. It would be from IntegerNullable
---> Integer
or Double
and BooleanNullable
to Boolean
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duh. Thanks.
Still need to add a test case for the impl but going to open it up for review. |
…into js_test_nullable_boolean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When this satisfies @dvreed77 , it satisfies me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Just left a couple nits/clarifying questions.
evalml/pipelines/components/transformers/preprocessing/replace_nullable_types.py
Outdated
Show resolved
Hide resolved
types_replace_null_handles = [ | ||
logical_types.AgeNullable, | ||
logical_types.Boolean, | ||
logical_types.BooleanNullable, | ||
logical_types.Double, | ||
logical_types.Integer, | ||
logical_types.IntegerNullable, | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean we're going to have a Replace Nullable Types Transformer
in almost every pipeline moving forward, even if there are no nulls or nullable types in the input data? Could you explain a little more about why that is? I'm rather uninformed w/r/t all the nullable types work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's the plan. We're doing this to make sure that nullables are handled correctly either by ReplaceNullableTypes
(when there isn't nan values) or by the imputer (when there are nan values). In the future the nullable types should be handle on the component or estimator level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm still missing something here. Are logical_types.Boolean/Double/Integer
nullable? Or is this to cover the case where there's no nullable types in the training data but there is in the test, similar to the reason we backed out conditionally including the imputer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay yup the latter is what I was going for!
This PR adds
BooleanNullable
andIntegerNullable
support inimputer.py
by converting the column toBoolean
orFloat
respectively after imputing. Because of this - we can drop theReplaceNullableTypes
transformer when input contains BooleanNullable.