Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use nullable type handling in components' fit, transform, and predict methods #4046

Merged

Conversation

tamargrey
Copy link
Contributor

@tamargrey tamargrey commented Mar 2, 2023

closes #3993 closes #4001, closes #3924, closes #3974, closes #4015, closes #3926

This PR doesnt handle the objective functions (which we're waiting on for the next sklearn release to be released in time to not have to handle ourselves 🤞 )

@codecov
Copy link

codecov bot commented Mar 2, 2023

Codecov Report

Merging #4046 (6a10435) into component-specific-nullable-types-handling (da6e3da) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@                             Coverage Diff                              @@
##           component-specific-nullable-types-handling   #4046     +/-   ##
============================================================================
+ Coverage                                        99.7%   99.7%   +0.1%     
============================================================================
  Files                                             349     349             
  Lines                                           37532   37579     +47     
============================================================================
+ Hits                                            37413   37461     +48     
+ Misses                                            119     118      -1     
Impacted Files Coverage Δ
...ents/estimators/classifiers/lightgbm_classifier.py 100.0% <100.0%> (ø)
...omponents/estimators/regressors/arima_regressor.py 100.0% <100.0%> (ø)
...tors/regressors/exponential_smoothing_regressor.py 100.0% <100.0%> (ø)
...onents/estimators/regressors/lightgbm_regressor.py 100.0% <100.0%> (ø)
...nents/transformers/imputers/time_series_imputer.py 100.0% <100.0%> (ø)
...s/components/transformers/samplers/base_sampler.py 90.0% <100.0%> (+1.3%) ⬆️
...es/components/transformers/samplers/oversampler.py 100.0% <100.0%> (ø)
...s/components/transformers/samplers/undersampler.py 100.0% <100.0%> (ø)
...alml/tests/component_tests/test_arima_regressor.py 100.0% <100.0%> (ø)
...nent_tests/test_exponential_smoothing_regressor.py 100.0% <100.0%> (ø)
... and 6 more

... and 1 file with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@tamargrey tamargrey force-pushed the component-specific-nullable-types-handling branch from 494c4a5 to 80be9be Compare March 6, 2023 15:44
X_schema = X_schema.get_subset_schema(
subset_cols=X_schema._filter_cols(
exclude=["IntegerNullable", "BooleanNullable", "AgeNullable"],
),
Copy link
Contributor Author

@tamargrey tamargrey Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change means that we'll no longer rely on woodwork inference for the nullable types after imputing. Instead, we're passing along these logical types even after imputation.

This is something I've thought a lot on, since it doesn't immediately make sense that we wouldn't either continue to rely on type inference or specify the non nullable types that we know we can convert to after inference.

  • Don't rely on inference - this will take longer since type inference takes time, and it leaves the types up to woodwork type inference which, as we all know, can change at any time. Not leaving these types up to inference I think makes this component more efficient and more consistent.
  • Keep the nullable types even after imputing - the goal of this epic is to be able to pass in nullable types wherever and whenever. We shouldn't need to avoid them anymore, so I don't want to add the extra complexity of determining what types to use unless it proves necessary. That might be if our dependencies show worse runtimes when continuing to use nullable types that are proving problematic for automl search times or if we ever make optimizations that depend on whether columns' types may contain nans or not (ex - never try to impute non nullable data. Though, of course, that's at the imputer level where this conversation is happening in the first place, so it's a moot point).

I'll end up applying these principles to the other imputers once I get to removing the old nullable type handling from the rest of automl search

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic all makes sense, pending my above comment about when we introduce nullability through the regularizer. Have you run performance tests or profiling on any of this, to validate your statement that avoiding inference is faster? From past experience I agree that it would be, but it'd be nice to see some numbers. It would also create a useful benchmark with your comment about dependency speed when using nullable types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

summarizing a convo from elsewhere: Performance testing to this point is showing the expected runtime improvements with no impact to model results.

"outputs": [],
"source": [
"# Prior to using test data, make sure the types match the training data\n",
"X_test.ww.init(schema=X_train.ww.schema)\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now needed for the training and test data's woodwork schemas to match and not raise the PipelineError that the input types are different from the types the data was fitted on.

The training and test data's logical types are now different because of the changes to the time series imputer to no longer rely on Woodwork type inference for IntegerNullable input columns after they're imputed, so instead of reinferring to Integer to match the test data, we pass in the original IntegerNullable.

Here's a longer explanation as to why the change in types isn't worrying me right now and what the implications of this are:

The training data has nans while the test doesnt because the TimeSeriesRegularizer introduces null values into the "Numeric" and "Categorical" columns in the training data, turning the logical type of Numeric from Integer into IntegerNullable which we then fill in with the time series imputer.

Now that the time series imputer won't return the type back to Integer and instead keeps the original IntegerNullable type, it's as problem for the component graph that sees those as different schemas.

In this PR, I don't think it's a bad thing to maintain the train and test datas' types. It's what we currently have to do if there are discrepancies between whether nans are present in the train vs test data. But in the long run, I think we may want to change _schema_is_equal to allow the nullable and non nullable types to be used interchangeably in the input data to pipelines, which would increase the flexibility of our pipelines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This brings up several interesting points:

  • If an Integer column is passed in but then the TimeSeriesRegularizer adds rows and converts the column type to IntegerNullable, do we want to be leaving the IntegerNullable type in place? The TimeSeriesRegularizer and TimeSeriesImputer were designed to work hand in hand, and it feels odd to change the data type when we're just fixing the data.
  • I agree with your idea to revisit _schema_is_equal to allow more flexibility with nullable types. I think we'd want to validate the existence of NaNs before ignoring nullablility, but we can discuss that in more detail in the future!

I am slightly concerned by the need for this call, especially since the note that it may be needed will be hidden within our docs - if a user encounters this, they probably won't have enough information to figure out what they need to do to make it work.

I'm also confused as to why we don't need this call until this far down in the notebook - we use X_test in many places (including many calls to predict, which call transform_all_but_final). Why do we only encounter this error down here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also interesting to me that you had to update the docs because of this change, but none of our tests broke - seems like a testing gap. This may change depending on how we decide to resolve my above comment, but it would be great to have a test outside of the docs to catch this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responding to your points

  1. I hadn't been thinking about the time series imputer that way (I didn't know that the imputer and regulizer only ever get used together, but it makes sense), but you're probably right that if a column was originally integer and it had to become int nullable because we inserted nans ourselves, we should keep that original type after we've fixed the data. I don't have a problem with having the time series imputer be an exception to the rule that our components should maintain the input type where possible--I'd make sure to have clear comments explaining why it's needed.
  2. In general, I don't want us to shy away from keeping IntegerNullable columns as such even if no nans are present (whether we impute them ourselves or users input them). Those types aren't really meant to imply the presence of nans, just that the type can support null values, but other than that they're the same as non nullable integers. For example, in Featuretools, we might output types as IntegerNullable from a Primitive so that users could pass nans in and not have it break. Overall, handling any changes to _schema_is_equal is something I'll do in a separate PR to main so that we give it the proper time it deserves, and we can continue this discussion then.
  3. I think the need for this call is a side effect of the fact that we change X_train and only X_train after splitting but prior to automl search in this doc; we were just lucky before that the types were the same. Even before these changes, the more a user manipulates only one of the train or test data, the higher the chance is that they make the two schemas incompatible. The fact that the incompatibility here came from calling evalml components outside of pipelines and I changed the behavior of one of those components isn't really indicative of a larger problem, imo. Using evalml components by themselves and then passing data into search isn't really our expected usecase, right? But in a pipeline that contained the regulaizer and then imputer with the exact same original X train and test from after we insert nans into the data, we would have always had a problem with the fact that the types would be different. In this doc, we avoid that problem by fixing the data ourselves with the components. On the other hand, if we had X train and test with no nans but that the regulizer would insert nans into train but not test, the pipeline with the regulizer and imputer wouldn't have a problem. But those cases I 100% agree we need more tests for different variations of train vs test data that you might have. My preference again would be to do that in a PR to main with any _schema_is_equal changes we're considering, since I think this can be considered separately from the nullable type changes, which are only relevant because they increase types we support as inputs.
  4. This is a very good question. I suspect it has something to do with a slight difference between how ComponentGraph._transform_features gets called from predict vs in a direct call to transform_all_but_final, but this is not immediately obvious to me, so I'll have to do some digging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding what's going on to allow predict but not transform_all_but_final:

  • At predict for TimeSeriesPipelineBase, we do indeed call self.transform_all_but_final, but when the training data is present, as it is in our predict calls, we first self._add_training_data_to_X_Y. Inside _add_training_data_to_X_Y, we concat X_train with X but use the logical types from X_train to set the woodwork types. So It means that the data we pass into transform all but final will always have the same logical types as the original training data.
  • At a direct transform_all_but_final, though, we don't pass in the training data, so we are passing in the original X_test which has the different types from the training data.

So I think you're right that most predict calls would have raised the same error; we just have slightly different behavior for time series pipelines. I think including proper testing for all of transform all but final, score, and predict with changes in nullable types for all problem types will be an important part of changing _schema_is_equal logic.

I'm gonna open an issue to put all of these thoughts into.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opened issue: #4077

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing and filing! Do we still need to reinitialize X_test.ww here, or is that mitigated by your changes to the TimeSeriesImputer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call -- removing!

@@ -159,7 +169,7 @@ def test_categorical_and_numeric_input(imputer_test_data):
y = pd.Series([0, 0, 1, 0, 1])
imputer = TimeSeriesImputer()
imputer.fit(X, y)
transformed, _ = imputer.transform(X, y)
transformed, _ = imputer.transform(X.ww.copy(), y)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was needed because _handle_nullable_types actually changes the input data, so the second imputer calls were being run without any nullable types.

I want to triple check that _handle_nullable_types mutating the input data isn't problematic. I seem to remember having a conversation in which I learned that evalml as a whole doesn't worry too much about not mutating input data, but I'm still worried this will prove to be problematic at the pipeline level somehow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that _handle_nullable_types mutates the input data was hiding some incompatibilities in predict/transform methods, because my test_lgbm_with_nullable_types test was passing in an X without nullable types to predict and predict_proba 🙃.

I still can't think of a reason good enough to copy the data every time. I think we just need to be aware of this in tests that are checking that nullable types can be passed into components, and in transformers, make sure we're maintaining the original types in the transformed data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great point and edge case - could you add some comments to the tests where you call copy for posterity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comments: 8273059

@@ -52,7 +59,7 @@ class TimeSeriesImputer(Transformer):
# Incompatibility: https://github.com/alteryx/evalml/issues/4001
# TODO: Remove when support is added https://github.com/alteryx/evalml/issues/4014
_integer_nullable_incompatibilities = ["X", "y"]
_boolean_nullable_incompatibilities = ["X", "y"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realized we could remove X from the boolean nullable incompatibilities because interpolate isn't a valid impute strategy for boolean columns in X (keeping y, though, because we don't stop interpolate from being called on boolean data - we currently convert to float, so I'm keeping that behavior for now - we should deal with that separately #4053 and it shouldnt need to be part of the nullable types work)

# --> https://github.com/alteryx/evalml/issues/4053
# by using the y.ww.logical_type, we'll get the downcast type as the final, which is expected and
# onyl problematic for boolean which will be floats which is nonsensical
y_imputed = ww.init_series(y_imputed, logical_type=y.ww.logical_type)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to highlight this behavior - we're avoiding woodwork inference and just using whatever type we downcast to for y. This will essentially match the current behavior in that we can end up with Boolean columns becoming Doubles with non 1s or 0s floating point values imputed for nans, which isn't great but isn't reachable from AutoML search.


if self._forwards_cols is not None:
X_forward = X.ww[self._forwards_cols]
X_forward = X[self._forwards_cols]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to stay in woodwork here since we're passing these into pandas methods that won't maintain the types. (In theory doing imputed.ww.pad does actually maintain the types, but since there's no way to batch __setitem__ to update a subset of a dataframe with new, woodwork typed data, we might as well just do a single init at the end rather than iterating over the columns and updating each individually 😢)

# --> this will lose the age ltype and I'm okay with that in this edge case
new_ltypes = {
col: Double
if isinstance(ltype, (IntegerNullable, AgeNullable))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will lose the AgeNullable logical type info, turning it into Double, but we doon't use the Age logical types a whole lot, and this is one small case where the extra logic didn't feel worth it. We can always add it in if we need it.

X, y = X_y_binary
X.ww.set_types({0: "Categorical", 1: "Boolean"})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting a float64 column to bool turns all the nonzero values to True, but pandas errors if you try to turn it into boolean, so I had to just add a new boolean column

y_encoded = self._encode_labels(y)
# --> why wasn't this here before? is it a problem to add
if y is not None:
y = infer_feature_types(y)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we weren't initializing woodwork for y if it wasn't already set, which is problematic for _handle_nullable_types which would init woodwork with type inference, which might not be all that different than what ends up happening in infer_feature_types, but I figure we should use the common util when we can.

X_encoded = downcast_int_nullable_to_double(X_encoded)
self._component_obj.fit(X_encoded, y)
# --> dont love calling this after we encode categories - try putting it first?
X_d, y_d = self._handle_nullable_types(X_encoded, y)
Copy link
Contributor Author

@tamargrey tamargrey Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_encode_categories calls infer_feature_types for X in the lightgbm estimators, and also has logic related to categorical columns in X that we wouldn't want to apply to any BooleanNullable columns we convert to Categorical as part of downcasting, so I think it makes sense to have the handle call occur after _encode_categories, but ideally we'd have the handle call before doing any other work.

for null_col in X.ww.select(IntegerNullable).columns
},
)
X.ww.init(schema=X.ww.schema)
Copy link
Contributor Author

@tamargrey tamargrey Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nullable type handling that we can now remove (the undersampler doesn't need nullable type handling).

Though worth noting that because the above astype calls are outside of woodwork, X had no ww types at this point, and this init call was actually with schema=None and we were just rerunning woodwork inference, which could've resulted in different oversamplers being chosen if inference changed which columns were Categorical (not super likely, which is why it never came up, I'm guessing)

So now we get to avoid the extra woodwork inference as well :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this section, the whole _prepare_data function is just making sure y isn't None and calling infer_feature_types. This feels unnecessary, other components just do that directly. Can we remove this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed! fd06f59

@tamargrey tamargrey marked this pull request as ready for review March 7, 2023 20:49
# Incompatibility is only when the sampler is actually called in transform,
# so we don't need to handle nullable types at fit
oversampler.fit(X, y)
assert not mock_handle_nullable_types.called
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to highlight that for many of the components with incompatibilities, it's only in one of the fit/predict/transform methods, so I've chosen to not call _handle_nullable_types when it's not needed, which matches behavior I've seen used in the past of only making incompatibility-related changes in the exact context they're needed.

Let me know if anyone has a problem with this, though. I made sure that none of the components have any logic that would change how the component is fit based on the downcasting of types to handle incompatibilities, but I may be missing something.

@tamargrey tamargrey force-pushed the component-specific-nullable-types-handling branch from 39e2826 to da6e3da Compare March 17, 2023 14:19
@tamargrey tamargrey merged commit 9541e99 into component-specific-nullable-types-handling Mar 17, 2023
@tamargrey tamargrey deleted the use-nullable-type-handling branch March 17, 2023 20:21
tamargrey added a commit that referenced this pull request Mar 20, 2023
… methods (#4046)

* Remove existing nullable type handling from oversampler and use _handle_nullable_types instead

* Add handle call for lgbm regressor and remove existing handling

* Add handle call for lgbm classifier

* temp broken exp smoothing tests

* lint fix

* add release note

* Fix broken tests by initting woodwork on y in lgbm classifier

* Update tests

* Call handle in arima

* call handle from ts imputer y ltype is downcasted value

* remove unnecessary comments

* Fix time series guide

* lint fix

* Only call handle_nullable_types when necessary in methods

* Remove remaining unnecessary handle calls

* resolve remaining comments

* Add y ww init to ts imputer back in to fix tests

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* use X_d in lgbm predict proba

* remove nullable type handling after sklearn upgrade fixed incompatibilities

* use common util to determine type for time series imputed integers

* Add comments around why we copy X

* remove _prepare_data from samplers

* PR comments

* remove tests to check if handle method is called

* remove nullable types from imputed data because of regularizer

* fix typo

* fix docstrings

* fix codecov issues

* PR comments

* Revert "Fix time series guide"

This reverts commit 964622a.

* return unchanged ltype in nullabl;e type utils

* add back ts imputer incompatibility test

* use dict get return value

* call handle nullable types in oversampler and check schema equality
tamargrey added a commit that referenced this pull request Mar 20, 2023
… methods (#4046)

* Remove existing nullable type handling from oversampler and use _handle_nullable_types instead

* Add handle call for lgbm regressor and remove existing handling

* Add handle call for lgbm classifier

* temp broken exp smoothing tests

* lint fix

* add release note

* Fix broken tests by initting woodwork on y in lgbm classifier

* Update tests

* Call handle in arima

* call handle from ts imputer y ltype is downcasted value

* remove unnecessary comments

* Fix time series guide

* lint fix

* Only call handle_nullable_types when necessary in methods

* Remove remaining unnecessary handle calls

* resolve remaining comments

* Add y ww init to ts imputer back in to fix tests

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* use X_d in lgbm predict proba

* remove nullable type handling after sklearn upgrade fixed incompatibilities

* use common util to determine type for time series imputed integers

* Add comments around why we copy X

* remove _prepare_data from samplers

* PR comments

* remove tests to check if handle method is called

* remove nullable types from imputed data because of regularizer

* fix typo

* fix docstrings

* fix codecov issues

* PR comments

* Revert "Fix time series guide"

This reverts commit 964622a.

* return unchanged ltype in nullabl;e type utils

* add back ts imputer incompatibility test

* use dict get return value

* call handle nullable types in oversampler and check schema equality
tamargrey added a commit that referenced this pull request Mar 22, 2023
… methods (#4046)

* Remove existing nullable type handling from oversampler and use _handle_nullable_types instead

* Add handle call for lgbm regressor and remove existing handling

* Add handle call for lgbm classifier

* temp broken exp smoothing tests

* lint fix

* add release note

* Fix broken tests by initting woodwork on y in lgbm classifier

* Update tests

* Call handle in arima

* call handle from ts imputer y ltype is downcasted value

* remove unnecessary comments

* Fix time series guide

* lint fix

* Only call handle_nullable_types when necessary in methods

* Remove remaining unnecessary handle calls

* resolve remaining comments

* Add y ww init to ts imputer back in to fix tests

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* use X_d in lgbm predict proba

* remove nullable type handling after sklearn upgrade fixed incompatibilities

* use common util to determine type for time series imputed integers

* Add comments around why we copy X

* remove _prepare_data from samplers

* PR comments

* remove tests to check if handle method is called

* remove nullable types from imputed data because of regularizer

* fix typo

* fix docstrings

* fix codecov issues

* PR comments

* Revert "Fix time series guide"

This reverts commit 964622a.

* return unchanged ltype in nullabl;e type utils

* add back ts imputer incompatibility test

* use dict get return value

* call handle nullable types in oversampler and check schema equality
tamargrey added a commit that referenced this pull request Mar 24, 2023
… methods (#4046)

* Remove existing nullable type handling from oversampler and use _handle_nullable_types instead

* Add handle call for lgbm regressor and remove existing handling

* Add handle call for lgbm classifier

* temp broken exp smoothing tests

* lint fix

* add release note

* Fix broken tests by initting woodwork on y in lgbm classifier

* Update tests

* Call handle in arima

* call handle from ts imputer y ltype is downcasted value

* remove unnecessary comments

* Fix time series guide

* lint fix

* Only call handle_nullable_types when necessary in methods

* Remove remaining unnecessary handle calls

* resolve remaining comments

* Add y ww init to ts imputer back in to fix tests

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* use X_d in lgbm predict proba

* remove nullable type handling after sklearn upgrade fixed incompatibilities

* use common util to determine type for time series imputed integers

* Add comments around why we copy X

* remove _prepare_data from samplers

* PR comments

* remove tests to check if handle method is called

* remove nullable types from imputed data because of regularizer

* fix typo

* fix docstrings

* fix codecov issues

* PR comments

* Revert "Fix time series guide"

This reverts commit 964622a.

* return unchanged ltype in nullabl;e type utils

* add back ts imputer incompatibility test

* use dict get return value

* call handle nullable types in oversampler and check schema equality
tamargrey added a commit that referenced this pull request Mar 27, 2023
… methods (#4046)

* Remove existing nullable type handling from oversampler and use _handle_nullable_types instead

* Add handle call for lgbm regressor and remove existing handling

* Add handle call for lgbm classifier

* temp broken exp smoothing tests

* lint fix

* add release note

* Fix broken tests by initting woodwork on y in lgbm classifier

* Update tests

* Call handle in arima

* call handle from ts imputer y ltype is downcasted value

* remove unnecessary comments

* Fix time series guide

* lint fix

* Only call handle_nullable_types when necessary in methods

* Remove remaining unnecessary handle calls

* resolve remaining comments

* Add y ww init to ts imputer back in to fix tests

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* use X_d in lgbm predict proba

* remove nullable type handling after sklearn upgrade fixed incompatibilities

* use common util to determine type for time series imputed integers

* Add comments around why we copy X

* remove _prepare_data from samplers

* PR comments

* remove tests to check if handle method is called

* remove nullable types from imputed data because of regularizer

* fix typo

* fix docstrings

* fix codecov issues

* PR comments

* Revert "Fix time series guide"

This reverts commit 964622a.

* return unchanged ltype in nullabl;e type utils

* add back ts imputer incompatibility test

* use dict get return value

* call handle nullable types in oversampler and check schema equality
tamargrey added a commit that referenced this pull request Mar 29, 2023
* initial commit to make the PR

* Refactor imputer components to remove unnecessary logic  (#4038)

* Stop using woodwork describe to get nan info in time series imputer

* remove logic that's no longer needed and return if all bool dtype

* remove unnecessary logic from target imputer

* remove unused utils

* remove logic to convert dfs features to categorical logical type

* fix email featureizer test

* Revert changes to transfomr prim components for testing

* Revert "Revert changes to transfomr prim components for testing"

This reverts commit 57dda43.

* Fix bugs with ww not imputed and string categories

* Add release note

* Handle case where nans are present at transform in previously all bool dtype data

* Stop truncating ints in target imputer

* clean up

* Fix tests

* Keep ltype integer for most frequent impute tpye in target imputer

* refactor knn imputer to use new logic

* Fix list bug

* remove comment

* Update release note to mention nullable types

* remove outdated comment

* Conver all bool dfs to boolean nullable instead of refitting

* lint fix

* PR comments

* add second bool col to imputer fixture

* Use nullable type handling in components' fit, transform, and predict methods (#4046)

* Remove existing nullable type handling from oversampler and use _handle_nullable_types instead

* Add handle call for lgbm regressor and remove existing handling

* Add handle call for lgbm classifier

* temp broken exp smoothing tests

* lint fix

* add release note

* Fix broken tests by initting woodwork on y in lgbm classifier

* Update tests

* Call handle in arima

* call handle from ts imputer y ltype is downcasted value

* remove unnecessary comments

* Fix time series guide

* lint fix

* Only call handle_nullable_types when necessary in methods

* Remove remaining unnecessary handle calls

* resolve remaining comments

* Add y ww init to ts imputer back in to fix tests

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* use X_d in lgbm predict proba

* remove nullable type handling after sklearn upgrade fixed incompatibilities

* use common util to determine type for time series imputed integers

* Add comments around why we copy X

* remove _prepare_data from samplers

* PR comments

* remove tests to check if handle method is called

* remove nullable types from imputed data because of regularizer

* fix typo

* fix docstrings

* fix codecov issues

* PR comments

* Revert "Fix time series guide"

This reverts commit 964622a.

* return unchanged ltype in nullabl;e type utils

* add back ts imputer incompatibility test

* use dict get return value

* call handle nullable types in oversampler and check schema equality

* Remove other nullable handling across AutoMLSearch (#4085)

* Update tests

* remove unnecessary comments

* resolve remaining comments

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* Handle new oversampler nullable type incompatibility and add tests

* Remove existing nullable type handling from oversampler and use _handle_nullable_types instead

* Add comments to self

* Add handle call for lgbm classifier

* lint fix

* Only call handle_nullable_types when necessary in methods

* Remove remaining unnecessary handle calls

* resolve remaining comments

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* remove nullable type handling after sklearn upgrade fixed incompatibilities

* Remove downcast call from imputer.py and fix tests

* remove downcast call from catboost regressor

* stop adding ohe if bool null present

* remove nullable type handling from knn imputer

* remove nullable type handling from targer imputer

* use util for ltype deciding in simple imputer

* Fix broken target imputer test

* Remove replace nullable types transformer from automl search and fix tests

* Handle nullable types in lgbm classifier predict_proba

* fix after rebase

* move util to imputer utils and other fixes

* remove duplicate util

* Change how imputer reinits woodwork types

* Expand tests

* use automl test env

* fix broken test

* Clean up

* Add release note

* Further clean up and note incompatibility

* Continue clean up

* fix failing docstring

* remove remaining comments

* fix invali target data check docstring

* let knn and simple imputers use same flow of logic for readability

* Move _get_new_logical_types_for_imputed_data to nullable type utils file

* PR comment

* Handle decomposer incompatibility (#4105)

* Handle decomposer incompatibility

* fix comment

* Stop scaling the values up

* Add release note

* PR Comments

* Create incompatibility for testing more similarly to how it shows up in the decomposer

* Add release note

* lint fix

* pr notes

* fix after rebase

---------

Co-authored-by: chukarsten <64713315+chukarsten@users.noreply.github.com>
@chukarsten chukarsten mentioned this pull request Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment