Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor imputer components to remove unnecessary logic #4038

Merged

Conversation

tamargrey
Copy link
Contributor

@tamargrey tamargrey commented Feb 28, 2023

Related to #3999 (Note - will not yet close this issue, as this PR does not remove the nullable type logic; it only refactors the unnecessary logic)

Refactors the following imputers

  • imputer.py
  • simple_imputer.py
  • time_series_imputer.py
  • target_imputer.py
  • knn_imputer.py

@codecov
Copy link

codecov bot commented Feb 28, 2023

Codecov Report

Merging #4038 (68ef36d) into component-specific-nullable-types-handling (80be9be) will decrease coverage by 0.0%.
The diff coverage is 100.0%.

@@                             Coverage Diff                              @@
##           component-specific-nullable-types-handling   #4038     +/-   ##
============================================================================
- Coverage                                        99.7%   99.7%   -0.0%     
============================================================================
  Files                                             349     349             
  Lines                                           37523   37505     -18     
============================================================================
- Hits                                            37405   37387     -18     
  Misses                                            118     118             
Impacted Files Coverage Δ
evalml/pipelines/components/utils.py 96.3% <ø> (-0.2%) ⬇️
.../tests/component_tests/test_time_series_imputer.py 100.0% <ø> (ø)
evalml/tests/component_tests/test_utils.py 99.2% <ø> (-0.1%) ⬇️
evalml/tests/conftest.py 98.3% <ø> (ø)
evalml/tests/utils_tests/test_gen_utils.py 99.5% <ø> (-<0.1%) ⬇️
...alml/tests/utils_tests/test_nullable_type_utils.py 100.0% <ø> (ø)
evalml/utils/gen_utils.py 99.3% <ø> (-<0.1%) ⬇️
...elines/components/transformers/imputers/imputer.py 100.0% <100.0%> (ø)
...es/components/transformers/imputers/knn_imputer.py 100.0% <100.0%> (ø)
...components/transformers/imputers/simple_imputer.py 100.0% <100.0%> (ø)
... and 10 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

# "and arrays.StringArray are about the same."
features = features.astype(object, copy=False)
features.index = X_ww.index
features.ww.init(logical_types={col_: "categorical" for col_ in features})
Copy link
Contributor Author

@tamargrey tamargrey Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This popped up as a problem in this CI run - by removing is_categorical_actually_boolean usage in the simple imputer, we were no longer recognizing features made from IsFreeEmailDomain as boolean columns and imputing them as such in test_imputer_can_impute_features_generated_from_null_email_url_features.

I think we're good to remove this block of code for two reasons

  1. Freddy added much of that in this PR. We should be good to remove it now that we’re upgraded past sklearn 1.1
  2. The fact that we were converting all columns to categorical logical type at all was originally done because of a change in woodwork inference, but we were also not using woodwork in featuretools at that time (which you can see from the usage of entity in the featuretools calls from that PR), so we no longer have to rely on woodwork inference to get the types of features, which will have woodwork types specified by the primitives used to make them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While sklearn 1.1 did fix the problem Freddy described, I found that removing this block of code entirely caused an issue in Catboost that I was seeing during #3966 with string categories. Converting to object fixed that problem, so I'm leaving that as is and will open a ticket to remove this once that bug is fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ticket to remove this workaround: #4051

if len(X_categorical.columns) > 0:
self._categorical_imputer.fit(X_categorical, y)
self._categorical_cols = X_categorical.columns

X_boolean = X[[col for col in bool_cols if col not in self._all_null_cols]]
X_boolean = X.ww[[col for col in bool_cols if col not in self._all_null_cols]]
Copy link
Contributor Author

@tamargrey tamargrey Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using the imputer in pipelines, we weren't fitting the child imputers on woodwork-initialized data.

I'm surprised this hasn't been noticed before, but I stumbled upon it because of these changes. Now that we're back to returning early if all the columns are the bool dtype in the simple imputer, this was causing an issue in a BooleanNullable column that would get re-inferred as Boolean at fit, so we'd return early and then raise an error that the _component_obj wasn't fit at transform when the correct BooleanNullable type was passed in.

@@ -2737,7 +2744,7 @@ def test_get_component_input_logical_types():
"cat": Categorical(),
"numeric": Integer(),
"EMAIL_ADDRESS_TO_DOMAIN(email)": Categorical(),
"IS_FREE_EMAIL_DOMAIN(email)": Categorical(),
"IS_FREE_EMAIL_DOMAIN(email)": BooleanNullable(),
Copy link
Contributor Author

@tamargrey tamargrey Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to not automatically convert all features to Categorical will result in BooleanNullable columns being introduced into X via the EmailFeaturizer after the ReplaceNullableTypes transformer has been used.

Because of that, I feel like this change should wait to be introduced with the rest of the nullable type handling. 😢

@tamargrey tamargrey changed the base branch from main to component-specific-nullable-types-handling March 1, 2023 21:42
@tamargrey tamargrey force-pushed the imputer-refactor branch 2 times, most recently from a4d1335 to ee47192 Compare March 1, 2023 21:47
[
col
for col in X.ww.select("Categorical")
if is_categorical_actually_boolean(X, col)
Copy link
Contributor Author

@tamargrey tamargrey Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explanation as to why we can remove is_categorical_actually_boolean:

  • We had "categorical booleans" because of the initial ReplaceNulalbleTypes transformer implementation that turned all BooleanNullable columns int Categorical.
  • This util originally added in Sklearn 1.1.1 Upgrade #3525 when we first started imputing boolean columns separately from categorical columns so we needed a way to separate them out.
  • We no longer create these categorical booleans (I caught and removed the last case where we were doing it in the EmailFeaturizer), and I don't think we need to handle the case where a user has created them, so uses of this should be removed.

if is_categorical_actually_boolean(X, col)
],
)
X = set_boolean_columns_to_integer(X)
Copy link
Contributor Author

@tamargrey tamargrey Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explanation as to why we can stop converting boolean columns to integers:

  • This originally began as converting booleans to categorical at fit inside the simple imputer in Pandas 1.4.0 Support #3324. Not 100% clear why this was needed, but I imagine it had to do with a pandas incompatibility with booleans.
  • Then, we started converting all boolean columns to categorical at transform as well in Make EvalML compatible with the new Woodwork Boolean inference #3892 (which I think was done because we couldn't reinfer the columns as booleans when they were imputed as booleans because sklearn turns them into 1s and 0s)
  • Eventually, turning bools to categories became converting to IntegerNullable in Fix BooleanNullable SimpleImputer bug #3959 to handle a pandas bug with imputing categorical booleans.
  • Given that the sklearn simple imputer's behavior is to turn boolean columns to 1s and 0s when imputed, we've now essentially recreated that behavior outside of it, so we should be able to just pass the original boolean and boolean nullable columns into the sklearn imputer and handle turning them back to Booleans at the end like we are now.
  • These changes also mean that we don't need to keep track of boolean columns and can stop reconverting columns to bool after imputing

Copy link
Contributor Author

@tamargrey tamargrey Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly worried that part of the original reason for converting boolean columns was because sklearn will not fit or transform its imputer on data that is only the bool dtype. Converting to category or int64 doesnt solves that problem because sklearn will happily take in data of only those types.

At one point we had the logic that I now added back in to return early and not fit the imputer on sklearn's imputer object if only bool columns are present, and I havent been able to confirm it, but it's possible that part of the reason we removed that logic is that it's problematic if nans get introduced in the test data - we would have not fit the component, causing errors.

Because of that, I added logic to store the information if the component was fit on fully bool data, and if the data at transform isn't fully bool, call fit on the component_obj, which I know isn't ideal to be imputing the target based on only test data.

To me this felt like enough of an edge case that this handling was okay given that it's the simplest solution, but I am open to other handlings (converting datasets of only bool columns to boolean nullable? just not handling this case at all?) if others feel strongly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we have to handle this edge case if I understood correctly:
Screenshot 2023-03-02 at 2 13 26 PM

"Given that the sklearn simple imputer's behavior is to turn boolean columns to 1s and 0s when imputed, we've now essentially recreated that behavior outside of it, so we should be able to just pass the original boolean and boolean nullable columns into the sklearn imputer and handle turning them back to Booleans at the end like we are now."

This is only true when the impute strategy is mean. Above we set it to most_frequent and we error out. This is the case we need since we only use most_frequent with bool columns in Imputer.

Could we convert the column to a string categorical or keep the current behavior of converting to a integer just for the most_frequent strategy or your suggestion above? I'd still prefer the conversion over refitting and transforming on the component object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with Jeremy here - calling fit during transform opens us up to future bugs. We only call fit once, and only ever call it on the training data. With a call to _component_obj.fit during transform, we now allow the component to fit on validation/testing data and overwrite earlier component training.
A quick test shows what would happen in the current setup:
Screen Shot 2023-03-03 at 9 23 37 AM

We will be better served converting the booleans to integers rather than categoricals for most_frequent to match the behavior for mean, and agreed that conversion is preferable to this refitting.

Copy link
Contributor Author

@tamargrey tamargrey Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both for keeping me honest here! I definitely missed the nuance that sklearn allows bool only data for "mean" and "median" strategies. I think there are two main questions to answer:

  1. How should we handle "mean" and "median" strategies for the SimpleImputer? We only allow knn or most_frequent as strategies for boolean columns in the Imputer component, but we don't restrict that in our SimpleImputer component despite the fact that we actually get a woodwork TypeConversionError at init if you attempt to use median or mean with a boolean nullable column that contains nans and the result contains not whole numbers (the cases becca's examples show - this is on main and my branch). For now, I think we should ignore this behavior since fixing it should be given proper consideration and it's not reachable from automl. I'll make an issue.
  2. How should we handle the case when only bool dtypes are passed to the SimpleImputer (independent of impute strategy which only really works for bool cols when its most_frequent)? We need to find a way to fit the _component_obj from our fit method in this case instead of returning early, and I think we should keep the changes as minimal as possible, and only convert those columns to BooleanNullable (which sklearn is fine with!) when there are only bool columns in the data. There's no need to convert other bool columns, and I think converting to string or numeric values just opens us up to unexpected behavior now that sklearn has so much support for the boolean dtype. I'll double check with all of the different types that there isn't any unexpected behavior, or course.

Really, I just wish sklearn didn't have this weird limitation. Seems to completely miss the exact use case we need it for which is different training and test data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created issue for the TypeConversionError: #4050

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good for 1. and agree with the BooleanNullable transform as long we set it back to bool after imputation!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out on making sure it goes back to Boolean after. I thiiink that's what's happening but I'll confirm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, we're only converting to boolean dtype in fit; we just then return early in transform if all the columns are bool , so the types will stay the same!

@tamargrey
Copy link
Contributor Author

I believe the codecov failures are just because enough code was removed that coverage has lowered, but if you go into the codecov report, none of the changes show uncovered lines.

)
new_logical_type = y_ww.ww.logical_type
if isinstance(y_ww.ww.logical_type, IntegerNullable):
new_logical_type = Double
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting imputed IntegerNullable data to Integer truncates the imputed values for mean and median, so I changed this to Double, which matches other imputers' behavior.

if isinstance(y_ww.ww.logical_type, IntegerNullable):
new_logical_type = Double
elif isinstance(y_ww.ww.logical_type, BooleanNullable):
new_logical_type = Boolean
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this for consistency's sake

@tamargrey tamargrey force-pushed the component-specific-nullable-types-handling branch from 216fac9 to 494c4a5 Compare March 2, 2023 16:01
)

# Only impute data that is not natural language columns or fully null
self._cols_to_impute = [
Copy link
Contributor Author

@tamargrey tamargrey Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change to use a _cols_to_impute variable to manage the fully null and natural language column logic here and at the simple imputer was a change that, to me, makes the imputer logic easier to understand. It makes the handling of different types of data more explicit rather than a sequence of changes to X to keep track of. It should also have slightly more efficient woodwork usage.

One implication of this is that we never even pass fully null columns to the imputer _component_obj, which could be problematic for test data that may not be fully null, but in that case, we'd have no information from the training data to use to impute data, and in the previous call to pd.DataFrame(X_t, columns=not_all_null_or_natural_language_cols), it assumes the fully null columns wouldn't be present anyway, so I'm assuming that it's okay to not pass the fully null columns in at all.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be ok as long as we're not passing the fully null or natural language column as input for both fit and transform. It is also ok to pass the full null columns in - I believe the sklearn imputer's drop them automatically.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2023-03-02 at 2 05 39 PM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep! It simplified things a bit to not pass them in at either fit or transform, but agreed it's something we need to be consistent on.

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than the all bool logic for the simple imputer. Lmk what you think!

)

# Only impute data that is not natural language columns or fully null
self._cols_to_impute = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be ok as long as we're not passing the fully null or natural language column as input for both fit and transform. It is also ok to pass the full null columns in - I believe the sklearn imputer's drop them automatically.

)

# Only impute data that is not natural language columns or fully null
self._cols_to_impute = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2023-03-02 at 2 05 39 PM

if is_categorical_actually_boolean(X, col)
],
)
X = set_boolean_columns_to_integer(X)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we have to handle this edge case if I understood correctly:
Screenshot 2023-03-02 at 2 13 26 PM

"Given that the sklearn simple imputer's behavior is to turn boolean columns to 1s and 0s when imputed, we've now essentially recreated that behavior outside of it, so we should be able to just pass the original boolean and boolean nullable columns into the sklearn imputer and handle turning them back to Booleans at the end like we are now."

This is only true when the impute strategy is mean. Above we set it to most_frequent and we error out. This is the case we need since we only use most_frequent with bool columns in Imputer.

Could we convert the column to a string categorical or keep the current behavior of converting to a integer just for the most_frequent strategy or your suggestion above? I'd still prefer the conversion over refitting and transforming on the component object.

eccabay
eccabay previously requested changes Mar 3, 2023
Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to Jeremy's feedback, I think we need to find a better way to handle booleans in the simple imputer.

Other than that though, it's mostly just nitpicks. I really appreciate this work!

docs/source/release_notes.rst Outdated Show resolved Hide resolved
if is_categorical_actually_boolean(X, col)
],
)
X = set_boolean_columns_to_integer(X)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with Jeremy here - calling fit during transform opens us up to future bugs. We only call fit once, and only ever call it on the training data. With a call to _component_obj.fit during transform, we now allow the component to fit on validation/testing data and overwrite earlier component training.
A quick test shows what would happen in the current setup:
Screen Shot 2023-03-03 at 9 23 37 AM

We will be better served converting the booleans to integers rather than categoricals for most_frequent to match the behavior for mean, and agreed that conversion is preferable to this refitting.

Comment on lines +143 to +142
# reorder columns to match original
X_t = X_t.ww[[col for col in original_schema.columns if col in X_t.columns]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏 this will prevent future mystery bugs, thank you!

imp = SimpleImputer()
imp.fit(X_train)

# X_test will be BooleanNullable which will be a problem when _component_obj isn't fit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused what this comment is saying here, could you clarify?

You also explicitly cast the woodwork types to be BooleanNullable, but they're already automatically inferred to be. This isn't a bad thing, it protects us against future changes in woodwork inference, but I'm a bit confused with the context of the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just the follow-up to the first comment, explaining the test case where we start with Boolean types in X_train and then have BooleanNullable types with nans in X_test. Once I switch to convert boolean types instead of calling fit inside of transform, this test will stay the same, though maybe I'll add a check that the actual imputed values are the same as the values that would have been imputed if the types were BooleanNullable all along

evalml/tests/component_tests/test_target_imputer.py Outdated Show resolved Hide resolved
@@ -199,7 +199,6 @@ def test_downcast_nullable_y_replaces_nullable_types(
nullable_ltype,
has_nans,
):
# --> remove this comment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😂

@eccabay eccabay dismissed their stale review March 3, 2023 22:45

Dismissed so I don't block your progress when I'm OOO next week!

@tamargrey tamargrey force-pushed the component-specific-nullable-types-handling branch from 494c4a5 to 80be9be Compare March 6, 2023 15:44
if is_categorical_actually_boolean(X, col)
],
)
X = set_boolean_columns_to_integer(X)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good for 1. and agree with the BooleanNullable transform as long we set it back to bool after imputation!

@jeremyliweishih
Copy link
Collaborator

LGTM! great work 😄

Comment on lines 673 to 684
X_train = pd.DataFrame(
{
"bools1": pd.Series([True, False, True, True] * 20),
"bools2": pd.Series([True, False, True, False] * 20),
},
)
X_train.ww.init(
logical_types={
"bools1": "Boolean",
"bools2": "Boolean",
},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding the all bools columns to the imputer_test_data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably!

I can add in a second all boolean no nulls column into the fixture that way I can get my "all bools" dataframe with a select("Boolean") call and then add nans in after fit to create the test data for this test. I can't think of anything that would be inherently problematic about adding a second Boolean column other than maybe some tests that only expect there to be one might fail.

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks solid to me. I just had the one question about roping the small ad-hoc dataframes into the larger fixture, but that's about it.

@tamargrey tamargrey merged commit 7d25090 into component-specific-nullable-types-handling Mar 8, 2023
@tamargrey tamargrey deleted the imputer-refactor branch March 8, 2023 20:19
tamargrey added a commit that referenced this pull request Mar 9, 2023
* Stop using woodwork describe to get nan info in time series imputer

* remove logic that's no longer needed and return if all bool dtype

* remove unnecessary logic from target imputer

* remove unused utils

* remove logic to convert dfs features to categorical logical type

* fix email featureizer test

* Revert changes to transfomr prim components for testing

* Revert "Revert changes to transfomr prim components for testing"

This reverts commit 57dda43.

* Fix bugs with ww not imputed and string categories

* Add release note

* Handle case where nans are present at transform in previously all bool dtype data

* Stop truncating ints in target imputer

* clean up

* Fix tests

* Keep ltype integer for most frequent impute tpye in target imputer

* refactor knn imputer to use new logic

* Fix list bug

* remove comment

* Update release note to mention nullable types

* remove outdated comment

* Conver all bool dfs to boolean nullable instead of refitting

* lint fix

* PR comments

* add second bool col to imputer fixture
tamargrey added a commit that referenced this pull request Mar 15, 2023
* Stop using woodwork describe to get nan info in time series imputer

* remove logic that's no longer needed and return if all bool dtype

* remove unnecessary logic from target imputer

* remove unused utils

* remove logic to convert dfs features to categorical logical type

* fix email featureizer test

* Revert changes to transfomr prim components for testing

* Revert "Revert changes to transfomr prim components for testing"

This reverts commit 57dda43.

* Fix bugs with ww not imputed and string categories

* Add release note

* Handle case where nans are present at transform in previously all bool dtype data

* Stop truncating ints in target imputer

* clean up

* Fix tests

* Keep ltype integer for most frequent impute tpye in target imputer

* refactor knn imputer to use new logic

* Fix list bug

* remove comment

* Update release note to mention nullable types

* remove outdated comment

* Conver all bool dfs to boolean nullable instead of refitting

* lint fix

* PR comments

* add second bool col to imputer fixture
tamargrey added a commit that referenced this pull request Mar 15, 2023
* Stop using woodwork describe to get nan info in time series imputer

* remove logic that's no longer needed and return if all bool dtype

* remove unnecessary logic from target imputer

* remove unused utils

* remove logic to convert dfs features to categorical logical type

* fix email featureizer test

* Revert changes to transfomr prim components for testing

* Revert "Revert changes to transfomr prim components for testing"

This reverts commit 57dda43.

* Fix bugs with ww not imputed and string categories

* Add release note

* Handle case where nans are present at transform in previously all bool dtype data

* Stop truncating ints in target imputer

* clean up

* Fix tests

* Keep ltype integer for most frequent impute tpye in target imputer

* refactor knn imputer to use new logic

* Fix list bug

* remove comment

* Update release note to mention nullable types

* remove outdated comment

* Conver all bool dfs to boolean nullable instead of refitting

* lint fix

* PR comments

* add second bool col to imputer fixture
tamargrey added a commit that referenced this pull request Mar 15, 2023
* Stop using woodwork describe to get nan info in time series imputer

* remove logic that's no longer needed and return if all bool dtype

* remove unnecessary logic from target imputer

* remove unused utils

* remove logic to convert dfs features to categorical logical type

* fix email featureizer test

* Revert changes to transfomr prim components for testing

* Revert "Revert changes to transfomr prim components for testing"

This reverts commit 57dda43.

* Fix bugs with ww not imputed and string categories

* Add release note

* Handle case where nans are present at transform in previously all bool dtype data

* Stop truncating ints in target imputer

* clean up

* Fix tests

* Keep ltype integer for most frequent impute tpye in target imputer

* refactor knn imputer to use new logic

* Fix list bug

* remove comment

* Update release note to mention nullable types

* remove outdated comment

* Conver all bool dfs to boolean nullable instead of refitting

* lint fix

* PR comments

* add second bool col to imputer fixture
tamargrey added a commit that referenced this pull request Mar 16, 2023
* Stop using woodwork describe to get nan info in time series imputer

* remove logic that's no longer needed and return if all bool dtype

* remove unnecessary logic from target imputer

* remove unused utils

* remove logic to convert dfs features to categorical logical type

* fix email featureizer test

* Revert changes to transfomr prim components for testing

* Revert "Revert changes to transfomr prim components for testing"

This reverts commit 57dda43.

* Fix bugs with ww not imputed and string categories

* Add release note

* Handle case where nans are present at transform in previously all bool dtype data

* Stop truncating ints in target imputer

* clean up

* Fix tests

* Keep ltype integer for most frequent impute tpye in target imputer

* refactor knn imputer to use new logic

* Fix list bug

* remove comment

* Update release note to mention nullable types

* remove outdated comment

* Conver all bool dfs to boolean nullable instead of refitting

* lint fix

* PR comments

* add second bool col to imputer fixture
tamargrey added a commit that referenced this pull request Mar 17, 2023
* Stop using woodwork describe to get nan info in time series imputer

* remove logic that's no longer needed and return if all bool dtype

* remove unnecessary logic from target imputer

* remove unused utils

* remove logic to convert dfs features to categorical logical type

* fix email featureizer test

* Revert changes to transfomr prim components for testing

* Revert "Revert changes to transfomr prim components for testing"

This reverts commit 57dda43.

* Fix bugs with ww not imputed and string categories

* Add release note

* Handle case where nans are present at transform in previously all bool dtype data

* Stop truncating ints in target imputer

* clean up

* Fix tests

* Keep ltype integer for most frequent impute tpye in target imputer

* refactor knn imputer to use new logic

* Fix list bug

* remove comment

* Update release note to mention nullable types

* remove outdated comment

* Conver all bool dfs to boolean nullable instead of refitting

* lint fix

* PR comments

* add second bool col to imputer fixture
tamargrey added a commit that referenced this pull request Mar 20, 2023
* Stop using woodwork describe to get nan info in time series imputer

* remove logic that's no longer needed and return if all bool dtype

* remove unnecessary logic from target imputer

* remove unused utils

* remove logic to convert dfs features to categorical logical type

* fix email featureizer test

* Revert changes to transfomr prim components for testing

* Revert "Revert changes to transfomr prim components for testing"

This reverts commit 57dda43.

* Fix bugs with ww not imputed and string categories

* Add release note

* Handle case where nans are present at transform in previously all bool dtype data

* Stop truncating ints in target imputer

* clean up

* Fix tests

* Keep ltype integer for most frequent impute tpye in target imputer

* refactor knn imputer to use new logic

* Fix list bug

* remove comment

* Update release note to mention nullable types

* remove outdated comment

* Conver all bool dfs to boolean nullable instead of refitting

* lint fix

* PR comments

* add second bool col to imputer fixture
tamargrey added a commit that referenced this pull request Mar 20, 2023
* Stop using woodwork describe to get nan info in time series imputer

* remove logic that's no longer needed and return if all bool dtype

* remove unnecessary logic from target imputer

* remove unused utils

* remove logic to convert dfs features to categorical logical type

* fix email featureizer test

* Revert changes to transfomr prim components for testing

* Revert "Revert changes to transfomr prim components for testing"

This reverts commit 57dda43.

* Fix bugs with ww not imputed and string categories

* Add release note

* Handle case where nans are present at transform in previously all bool dtype data

* Stop truncating ints in target imputer

* clean up

* Fix tests

* Keep ltype integer for most frequent impute tpye in target imputer

* refactor knn imputer to use new logic

* Fix list bug

* remove comment

* Update release note to mention nullable types

* remove outdated comment

* Conver all bool dfs to boolean nullable instead of refitting

* lint fix

* PR comments

* add second bool col to imputer fixture
tamargrey added a commit that referenced this pull request Mar 22, 2023
* Stop using woodwork describe to get nan info in time series imputer

* remove logic that's no longer needed and return if all bool dtype

* remove unnecessary logic from target imputer

* remove unused utils

* remove logic to convert dfs features to categorical logical type

* fix email featureizer test

* Revert changes to transfomr prim components for testing

* Revert "Revert changes to transfomr prim components for testing"

This reverts commit 57dda43.

* Fix bugs with ww not imputed and string categories

* Add release note

* Handle case where nans are present at transform in previously all bool dtype data

* Stop truncating ints in target imputer

* clean up

* Fix tests

* Keep ltype integer for most frequent impute tpye in target imputer

* refactor knn imputer to use new logic

* Fix list bug

* remove comment

* Update release note to mention nullable types

* remove outdated comment

* Conver all bool dfs to boolean nullable instead of refitting

* lint fix

* PR comments

* add second bool col to imputer fixture
tamargrey added a commit that referenced this pull request Mar 24, 2023
* Stop using woodwork describe to get nan info in time series imputer

* remove logic that's no longer needed and return if all bool dtype

* remove unnecessary logic from target imputer

* remove unused utils

* remove logic to convert dfs features to categorical logical type

* fix email featureizer test

* Revert changes to transfomr prim components for testing

* Revert "Revert changes to transfomr prim components for testing"

This reverts commit 57dda43.

* Fix bugs with ww not imputed and string categories

* Add release note

* Handle case where nans are present at transform in previously all bool dtype data

* Stop truncating ints in target imputer

* clean up

* Fix tests

* Keep ltype integer for most frequent impute tpye in target imputer

* refactor knn imputer to use new logic

* Fix list bug

* remove comment

* Update release note to mention nullable types

* remove outdated comment

* Conver all bool dfs to boolean nullable instead of refitting

* lint fix

* PR comments

* add second bool col to imputer fixture
tamargrey added a commit that referenced this pull request Mar 27, 2023
* Stop using woodwork describe to get nan info in time series imputer

* remove logic that's no longer needed and return if all bool dtype

* remove unnecessary logic from target imputer

* remove unused utils

* remove logic to convert dfs features to categorical logical type

* fix email featureizer test

* Revert changes to transfomr prim components for testing

* Revert "Revert changes to transfomr prim components for testing"

This reverts commit 57dda43.

* Fix bugs with ww not imputed and string categories

* Add release note

* Handle case where nans are present at transform in previously all bool dtype data

* Stop truncating ints in target imputer

* clean up

* Fix tests

* Keep ltype integer for most frequent impute tpye in target imputer

* refactor knn imputer to use new logic

* Fix list bug

* remove comment

* Update release note to mention nullable types

* remove outdated comment

* Conver all bool dfs to boolean nullable instead of refitting

* lint fix

* PR comments

* add second bool col to imputer fixture
tamargrey added a commit that referenced this pull request Mar 29, 2023
* initial commit to make the PR

* Refactor imputer components to remove unnecessary logic  (#4038)

* Stop using woodwork describe to get nan info in time series imputer

* remove logic that's no longer needed and return if all bool dtype

* remove unnecessary logic from target imputer

* remove unused utils

* remove logic to convert dfs features to categorical logical type

* fix email featureizer test

* Revert changes to transfomr prim components for testing

* Revert "Revert changes to transfomr prim components for testing"

This reverts commit 57dda43.

* Fix bugs with ww not imputed and string categories

* Add release note

* Handle case where nans are present at transform in previously all bool dtype data

* Stop truncating ints in target imputer

* clean up

* Fix tests

* Keep ltype integer for most frequent impute tpye in target imputer

* refactor knn imputer to use new logic

* Fix list bug

* remove comment

* Update release note to mention nullable types

* remove outdated comment

* Conver all bool dfs to boolean nullable instead of refitting

* lint fix

* PR comments

* add second bool col to imputer fixture

* Use nullable type handling in components' fit, transform, and predict methods (#4046)

* Remove existing nullable type handling from oversampler and use _handle_nullable_types instead

* Add handle call for lgbm regressor and remove existing handling

* Add handle call for lgbm classifier

* temp broken exp smoothing tests

* lint fix

* add release note

* Fix broken tests by initting woodwork on y in lgbm classifier

* Update tests

* Call handle in arima

* call handle from ts imputer y ltype is downcasted value

* remove unnecessary comments

* Fix time series guide

* lint fix

* Only call handle_nullable_types when necessary in methods

* Remove remaining unnecessary handle calls

* resolve remaining comments

* Add y ww init to ts imputer back in to fix tests

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* use X_d in lgbm predict proba

* remove nullable type handling after sklearn upgrade fixed incompatibilities

* use common util to determine type for time series imputed integers

* Add comments around why we copy X

* remove _prepare_data from samplers

* PR comments

* remove tests to check if handle method is called

* remove nullable types from imputed data because of regularizer

* fix typo

* fix docstrings

* fix codecov issues

* PR comments

* Revert "Fix time series guide"

This reverts commit 964622a.

* return unchanged ltype in nullabl;e type utils

* add back ts imputer incompatibility test

* use dict get return value

* call handle nullable types in oversampler and check schema equality

* Remove other nullable handling across AutoMLSearch (#4085)

* Update tests

* remove unnecessary comments

* resolve remaining comments

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* Handle new oversampler nullable type incompatibility and add tests

* Remove existing nullable type handling from oversampler and use _handle_nullable_types instead

* Add comments to self

* Add handle call for lgbm classifier

* lint fix

* Only call handle_nullable_types when necessary in methods

* Remove remaining unnecessary handle calls

* resolve remaining comments

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* remove nullable type handling after sklearn upgrade fixed incompatibilities

* Remove downcast call from imputer.py and fix tests

* remove downcast call from catboost regressor

* stop adding ohe if bool null present

* remove nullable type handling from knn imputer

* remove nullable type handling from targer imputer

* use util for ltype deciding in simple imputer

* Fix broken target imputer test

* Remove replace nullable types transformer from automl search and fix tests

* Handle nullable types in lgbm classifier predict_proba

* fix after rebase

* move util to imputer utils and other fixes

* remove duplicate util

* Change how imputer reinits woodwork types

* Expand tests

* use automl test env

* fix broken test

* Clean up

* Add release note

* Further clean up and note incompatibility

* Continue clean up

* fix failing docstring

* remove remaining comments

* fix invali target data check docstring

* let knn and simple imputers use same flow of logic for readability

* Move _get_new_logical_types_for_imputed_data to nullable type utils file

* PR comment

* Handle decomposer incompatibility (#4105)

* Handle decomposer incompatibility

* fix comment

* Stop scaling the values up

* Add release note

* PR Comments

* Create incompatibility for testing more similarly to how it shows up in the decomposer

* Add release note

* lint fix

* pr notes

* fix after rebase

---------

Co-authored-by: chukarsten <64713315+chukarsten@users.noreply.github.com>
@chukarsten chukarsten mentioned this pull request Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove Nullable type logic from Imputer Components and Refactor
4 participants