Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove other nullable handling across AutoMLSearch #4085

Conversation

tamargrey
Copy link
Contributor

@tamargrey tamargrey commented Mar 17, 2023

closes #3994, closes #3999, closes #4000

@tamargrey tamargrey changed the base branch from main to component-specific-nullable-types-handling March 17, 2023 20:32
@codecov
Copy link

codecov bot commented Mar 17, 2023

Codecov Report

Merging #4085 (8599f7a) into component-specific-nullable-types-handling (63f12da) will increase coverage by 15.3%.
The diff coverage is 100.0%.

@@                              Coverage Diff                              @@
##           component-specific-nullable-types-handling   #4085      +/-   ##
=============================================================================
+ Coverage                                        84.5%   99.7%   +15.3%     
=============================================================================
  Files                                             349     349              
  Lines                                           37579   37607      +28     
=============================================================================
+ Hits                                            31748   37490    +5742     
+ Misses                                           5831     117    -5714     
Impacted Files Coverage Δ
evalml/data_checks/invalid_target_data_check.py 100.0% <ø> (ø)
...alml/data_checks/target_distribution_data_check.py 100.0% <ø> (ø)
...onents/estimators/regressors/catboost_regressor.py 100.0% <ø> (ø)
...tors/regressors/exponential_smoothing_regressor.py 100.0% <ø> (ø)
...nents/transformers/imputers/time_series_imputer.py 100.0% <ø> (ø)
evalml/pipelines/utils.py 99.6% <ø> (+1.8%) ⬆️
...valml/tests/automl_tests/test_default_algorithm.py 100.0% <ø> (+100.0%) ⬆️
evalml/tests/data_checks_tests/test_data_checks.py 100.0% <ø> (ø)
...hecks_tests/test_target_distribution_data_check.py 100.0% <ø> (ø)
..._tests/test_data_checks_and_actions_integration.py 100.0% <ø> (ø)
... and 18 more

... and 32 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

bool_ltype = "BooleanNullable" if use_nullable else "Boolean"
X.ww.set_types({0: "Categorical"})
X.ww[len(X.columns)] = ww.init_series(
pd.Series([True, False] * 50),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is trying to artificially add Categorical and Boolean columns to X_y_binary which is all float columns. The Boolean transformation works, but BooleanNullable does not allow converting from numeric values that aren't 1s or 0s. So I changed this test to set one of the float columns to be Categorical and add in new Boolean column of actual boolean values. Similar changes had to be made to test_smotenc_boolean_numeric for the same reasons.

@@ -231,7 +231,7 @@ def test_data_checks_impute_cols(problem_type):
expected_pipeline_class = MulticlassClassificationPipeline
y_expected = ww.init_series(
pd.Series([0, 1, 2, 2, 2]),
logical_type="Integer",
logical_type="IntegerNullable",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These testing changes are due to the target imputer changes to maintain original logical type if possible

@@ -2746,6 +2746,6 @@ def test_get_component_input_logical_types():
no_estimator.fit(X, y)
assert no_estimator.last_component_input_logical_types == {
"cat": Categorical(),
"numeric": Double(),
"numeric": Integer(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a side effect of the change to Imputer to keep logical types where possible

"X_has_nans",
[True, False],
)
def test_partial_dependence_with_nullable_types(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this test because mquantiles, which we use in _grid_from_X, has a nullable type incompatibility when null values are present in Int64 or boolean data.

This isn't actually going to be problematic, though, because the incompatibility is only when null values are present in the data, and remove any nans prior to passing the data into mquantiles https://github.com/alteryx/evalml/blob/main/evalml/model_understanding/_partial_dependence_utils.py#L178-L185. But I wanted to add this test in case we ever changed how we use mquantiles.

# Numeric imputing may have changed logical types because of use of _get_new_logical_types_for_imputed_data
# --> this is only needed because of test_imputer_does_not_erase_ww_info test - otherwise we always assume ww types would be present
if imputed.ww.schema is None:
imputed.ww.init()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be that imputers called by Imputer should all handle the woodwork initialization and type changes needed according to the specific impute behavior. However, the test test_imputer_does_not_erase_ww_info, from what I can tell, is specifically testing that if the imputers called by Imputer don't initialize woodwork, that the Imputer component still will and will maintain types.

So I added this woodwork init to handle that case to honor the spirit of that test. But I'd be happy to remove the initalization logic here and change or remove test_imputer_does_not_erase_ww_info if people agree that it doesn't feel necessary to keep.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The does_not_erase_ww_info tests came from an error where some of our components would drop typing information, so we had to waste time reinferring and we potentially lost typing information. It would make sense to try and delegate all typing work to the imputer subclasses, but I think it's safer in the long run to keep ensuring typing info here as well. It'd be great if we could initialize more intelligently to avoid inference where possible, but I'm not sure if that's doable here.

A related clarifying question, we don't need to do anything with boolean columns here because they won't need to change like int->float does and we have all the boolean nullable handling we need in other components, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - so because we're doing a ww.init() call here, it does open up the possibility of reinferring and losing types if the simple imputer stopped maintaining the woodwork schema, but, like you say, there's not a great way to avoid this, and at least it would just be the numeric columns. The good thing is that we don't expect to actually need to init woodwork here, so we're okay leaving it as is.

And to answer your question: that is correct. We can fully maintain all the non integer types, and that allows us to not have any similar logic for any of the other types of columns.

)
# --> should probably use the automl test env but im not sure if its okay to have the score return value be set?
env = AutoMLTestEnv(problem_type)
with env.test_context(score_return_value={automl.objective.name: 1.0}):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to double check that using the automl test env this way doesnt defeat the purpose of this test. I figured we wanted to use it otherwise these tests take forever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be good :)

assert (
combined_pipeline.component_graph.get_inputs("First Pipeline - Imputer")[0]
== "First Pipeline - Replace Nullable Types Transformer.x"
== "DFS Transformer.x"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% this is the right check, but it seems like the main purpose was to confirm the order of components was correct, for which the DFS Transformer could be used now that the replace nullable types transformer is gone.

)
assert order.index(
f"{dtype} Pipeline - Replace Nullable Types Transformer",
) < order.index(f"{dtype} Pipeline - Imputer")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, this is testing that in categorical and numeric split pipelines, the replace nullable types transformer comes after the label encoder. So now that replace nullable types transformer is gone, there's no need for such a check? If the goal of the test was instead about the placement of the label encoder, I can add in a different check, but I feel like the assert order[0] == "Label Encoder" if not features else "DFS Transformer" check above should cover that..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this was a recent patch on my end, I should have left a comment. The goal of this test is to ensure that splitting the pipeline doesn't mess up the computation order. Some components need to be computed before others, so we need to ensure that that still occurs.

The original test was asserting the exact computation order of all the components, which changed when I upgraded our networkx version. I changed it to focus just on the "what components need to be computed before other components" part, relaxing the strict order requirement. We should still keep this test to some degree, but we can skip the Replace Nullable Types Transformer part of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, so I'll keep this as is with the assert order[0] == "Label Encoder" if not features else "DFS Transformer" check in this test maintaining coverage of the expected order of things!

@tamargrey tamargrey marked this pull request as ready for review March 20, 2023 18:12
Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, just a few small things!

@@ -95,23 +98,18 @@ def transform(self, X, y=None):
X_t = self._component_obj.transform(X[self._cols_to_impute])
X_t = pd.DataFrame(X_t, columns=self._cols_to_impute)

# Get Woodwork types for the imputed data
new_schema = original_schema.get_subset_schema(self._cols_to_impute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a specific reason you decided to get rid of new_schema here but keep it in the simple imputer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No specific reason. I'm now realizing how similar the knn and simple imputers' transform methods really are. Really the only difference is the bool logic in the simple imputer to return early if needed.

I can refactor them to follow the same flow of logic and have more similar variable names so that it's easier to read through them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in 7057482 - this uncovered an undesired behavior when using only original_schema to reinit woodwork when fully null columns were present, so I stuck with the behavior of defining the new schema as its own variable but changed its name to be imputed_schema to hopefully be slightly more clear than new_schema was

# Numeric imputing may have changed logical types because of use of _get_new_logical_types_for_imputed_data
# --> this is only needed because of test_imputer_does_not_erase_ww_info test - otherwise we always assume ww types would be present
if imputed.ww.schema is None:
imputed.ww.init()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The does_not_erase_ww_info tests came from an error where some of our components would drop typing information, so we had to waste time reinferring and we potentially lost typing information. It would make sense to try and delegate all typing work to the imputer subclasses, but I think it's safer in the long run to keep ensuring typing info here as well. It'd be great if we could initialize more intelligently to avoid inference where possible, but I'm not sure if that's doable here.

A related clarifying question, we don't need to do anything with boolean columns here because they won't need to change like int->float does and we have all the boolean nullable handling we need in other components, right?

from evalml.utils.nullable_type_utils import _determine_fractional_type


def _get_new_logical_types_for_imputed_data(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like pulling all the typing into a single function here, all the imputers are so much easier to read now.

It feels a little odd to add a whole new file just for one function. I really wanted to be able to add this as a static method to a parent imputer class that they all inherit from, but we don't have that set up at this point. I also like to avoid calling private functions outside of the file where they were defined. What are your thoughts on moving this to nullable_types_utils? This function seems like it fits under the nullable util umbrella.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's originally where I had it--I can never decide where to put things! I'll happily move it back there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done - af45447

)
assert order.index(
f"{dtype} Pipeline - Replace Nullable Types Transformer",
) < order.index(f"{dtype} Pipeline - Imputer")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this was a recent patch on my end, I should have left a comment. The goal of this test is to ensure that splitting the pipeline doesn't mess up the computation order. Some components need to be computed before others, so we need to ensure that that still occurs.

The original test was asserting the exact computation order of all the components, which changed when I upgraded our networkx version. I changed it to focus just on the "what components need to be computed before other components" part, relaxing the strict order requirement. We should still keep this test to some degree, but we can skip the Replace Nullable Types Transformer part of it.

Comment on lines -68 to +71
assert all([ReplaceNullableTypes.name in pl for pl in pipelines])

# Confirm ReplaceNullableTypes transformer isn't used
assert not any([ReplaceNullableTypes.name in pl for pl in pipelines])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@tamargrey tamargrey requested a review from eccabay March 21, 2023 20:27
@tamargrey tamargrey force-pushed the component-specific-nullable-types-handling branch from 07b89dc to 63f12da Compare March 22, 2023 14:42
@tamargrey tamargrey force-pushed the remove-other-nullable-handling branch from 0ee6b8f to 8599f7a Compare March 22, 2023 14:43
@@ -603,6 +607,8 @@ def test_imputer_int_preserved():
0: Double,
}

# If no null values need to be imputed, integer column should be maintained
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate these comments - good practice.

@@ -34,7 +34,7 @@ def test_knn_imputer_ignores_natural_language(
imputer_test_data,
df_composition,
):
"""Test to ensure that the simple imputer just passes through
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can scold @MichaelFu512 for this when he comes back :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol I have no feet to stand on -- I did the exact same thing and @eccabay had to correct me!

)
# --> should probably use the automl test env but im not sure if its okay to have the score return value be set?
env = AutoMLTestEnv(problem_type)
with env.test_context(score_return_value={automl.objective.name: 1.0}):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be good :)

@tamargrey tamargrey merged commit 75a4714 into component-specific-nullable-types-handling Mar 23, 2023
@tamargrey tamargrey deleted the remove-other-nullable-handling branch March 23, 2023 13:48
tamargrey added a commit that referenced this pull request Mar 24, 2023
* Update tests

* remove unnecessary comments

* resolve remaining comments

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* Handle new oversampler nullable type incompatibility and add tests

* Remove existing nullable type handling from oversampler and use _handle_nullable_types instead

* Add comments to self

* Add handle call for lgbm classifier

* lint fix

* Only call handle_nullable_types when necessary in methods

* Remove remaining unnecessary handle calls

* resolve remaining comments

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* remove nullable type handling after sklearn upgrade fixed incompatibilities

* Remove downcast call from imputer.py and fix tests

* remove downcast call from catboost regressor

* stop adding ohe if bool null present

* remove nullable type handling from knn imputer

* remove nullable type handling from targer imputer

* use util for ltype deciding in simple imputer

* Fix broken target imputer test

* Remove replace nullable types transformer from automl search and fix tests

* Handle nullable types in lgbm classifier predict_proba

* fix after rebase

* move util to imputer utils and other fixes

* remove duplicate util

* Change how imputer reinits woodwork types

* Expand tests

* use automl test env

* fix broken test

* Clean up

* Add release note

* Further clean up and note incompatibility

* Continue clean up

* fix failing docstring

* remove remaining comments

* fix invali target data check docstring

* let knn and simple imputers use same flow of logic for readability

* Move _get_new_logical_types_for_imputed_data to nullable type utils file

* PR comment
tamargrey added a commit that referenced this pull request Mar 27, 2023
* Update tests

* remove unnecessary comments

* resolve remaining comments

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* Handle new oversampler nullable type incompatibility and add tests

* Remove existing nullable type handling from oversampler and use _handle_nullable_types instead

* Add comments to self

* Add handle call for lgbm classifier

* lint fix

* Only call handle_nullable_types when necessary in methods

* Remove remaining unnecessary handle calls

* resolve remaining comments

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* remove nullable type handling after sklearn upgrade fixed incompatibilities

* Remove downcast call from imputer.py and fix tests

* remove downcast call from catboost regressor

* stop adding ohe if bool null present

* remove nullable type handling from knn imputer

* remove nullable type handling from targer imputer

* use util for ltype deciding in simple imputer

* Fix broken target imputer test

* Remove replace nullable types transformer from automl search and fix tests

* Handle nullable types in lgbm classifier predict_proba

* fix after rebase

* move util to imputer utils and other fixes

* remove duplicate util

* Change how imputer reinits woodwork types

* Expand tests

* use automl test env

* fix broken test

* Clean up

* Add release note

* Further clean up and note incompatibility

* Continue clean up

* fix failing docstring

* remove remaining comments

* fix invali target data check docstring

* let knn and simple imputers use same flow of logic for readability

* Move _get_new_logical_types_for_imputed_data to nullable type utils file

* PR comment
tamargrey added a commit that referenced this pull request Mar 29, 2023
* initial commit to make the PR

* Refactor imputer components to remove unnecessary logic  (#4038)

* Stop using woodwork describe to get nan info in time series imputer

* remove logic that's no longer needed and return if all bool dtype

* remove unnecessary logic from target imputer

* remove unused utils

* remove logic to convert dfs features to categorical logical type

* fix email featureizer test

* Revert changes to transfomr prim components for testing

* Revert "Revert changes to transfomr prim components for testing"

This reverts commit 57dda43.

* Fix bugs with ww not imputed and string categories

* Add release note

* Handle case where nans are present at transform in previously all bool dtype data

* Stop truncating ints in target imputer

* clean up

* Fix tests

* Keep ltype integer for most frequent impute tpye in target imputer

* refactor knn imputer to use new logic

* Fix list bug

* remove comment

* Update release note to mention nullable types

* remove outdated comment

* Conver all bool dfs to boolean nullable instead of refitting

* lint fix

* PR comments

* add second bool col to imputer fixture

* Use nullable type handling in components' fit, transform, and predict methods (#4046)

* Remove existing nullable type handling from oversampler and use _handle_nullable_types instead

* Add handle call for lgbm regressor and remove existing handling

* Add handle call for lgbm classifier

* temp broken exp smoothing tests

* lint fix

* add release note

* Fix broken tests by initting woodwork on y in lgbm classifier

* Update tests

* Call handle in arima

* call handle from ts imputer y ltype is downcasted value

* remove unnecessary comments

* Fix time series guide

* lint fix

* Only call handle_nullable_types when necessary in methods

* Remove remaining unnecessary handle calls

* resolve remaining comments

* Add y ww init to ts imputer back in to fix tests

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* use X_d in lgbm predict proba

* remove nullable type handling after sklearn upgrade fixed incompatibilities

* use common util to determine type for time series imputed integers

* Add comments around why we copy X

* remove _prepare_data from samplers

* PR comments

* remove tests to check if handle method is called

* remove nullable types from imputed data because of regularizer

* fix typo

* fix docstrings

* fix codecov issues

* PR comments

* Revert "Fix time series guide"

This reverts commit 964622a.

* return unchanged ltype in nullabl;e type utils

* add back ts imputer incompatibility test

* use dict get return value

* call handle nullable types in oversampler and check schema equality

* Remove other nullable handling across AutoMLSearch (#4085)

* Update tests

* remove unnecessary comments

* resolve remaining comments

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* Handle new oversampler nullable type incompatibility and add tests

* Remove existing nullable type handling from oversampler and use _handle_nullable_types instead

* Add comments to self

* Add handle call for lgbm classifier

* lint fix

* Only call handle_nullable_types when necessary in methods

* Remove remaining unnecessary handle calls

* resolve remaining comments

* Copy X in testing nullable types to stop hiding potential incompatibilities in methods

* remove nullable type handling after sklearn upgrade fixed incompatibilities

* Remove downcast call from imputer.py and fix tests

* remove downcast call from catboost regressor

* stop adding ohe if bool null present

* remove nullable type handling from knn imputer

* remove nullable type handling from targer imputer

* use util for ltype deciding in simple imputer

* Fix broken target imputer test

* Remove replace nullable types transformer from automl search and fix tests

* Handle nullable types in lgbm classifier predict_proba

* fix after rebase

* move util to imputer utils and other fixes

* remove duplicate util

* Change how imputer reinits woodwork types

* Expand tests

* use automl test env

* fix broken test

* Clean up

* Add release note

* Further clean up and note incompatibility

* Continue clean up

* fix failing docstring

* remove remaining comments

* fix invali target data check docstring

* let knn and simple imputers use same flow of logic for readability

* Move _get_new_logical_types_for_imputed_data to nullable type utils file

* PR comment

* Handle decomposer incompatibility (#4105)

* Handle decomposer incompatibility

* fix comment

* Stop scaling the values up

* Add release note

* PR Comments

* Create incompatibility for testing more similarly to how it shows up in the decomposer

* Add release note

* lint fix

* pr notes

* fix after rebase

---------

Co-authored-by: chukarsten <64713315+chukarsten@users.noreply.github.com>
@chukarsten chukarsten mentioned this pull request Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants