Remove other nullable handling across AutoMLSearch #4085

tamargrey · 2023-03-17T20:32:07Z

closes #3994, closes #3999, closes #4000

codecov · 2023-03-17T20:40:38Z

Codecov Report

Merging #4085 (8599f7a) into component-specific-nullable-types-handling (63f12da) will increase coverage by 15.3%.
The diff coverage is 100.0%.

@@                              Coverage Diff                              @@
##           component-specific-nullable-types-handling   #4085      +/-   ##
=============================================================================
+ Coverage                                        84.5%   99.7%   +15.3%     
=============================================================================
  Files                                             349     349              
  Lines                                           37579   37607      +28     
=============================================================================
+ Hits                                            31748   37490    +5742     
+ Misses                                           5831     117    -5714

Impacted Files	Coverage Δ
evalml/data_checks/invalid_target_data_check.py	`100.0% <ø> (ø)`
...alml/data_checks/target_distribution_data_check.py	`100.0% <ø> (ø)`
...onents/estimators/regressors/catboost_regressor.py	`100.0% <ø> (ø)`
...tors/regressors/exponential_smoothing_regressor.py	`100.0% <ø> (ø)`
...nents/transformers/imputers/time_series_imputer.py	`100.0% <ø> (ø)`
evalml/pipelines/utils.py	`99.6% <ø> (+1.8%)`	⬆️
...valml/tests/automl_tests/test_default_algorithm.py	`100.0% <ø> (+100.0%)`	⬆️
evalml/tests/data_checks_tests/test_data_checks.py	`100.0% <ø> (ø)`
...hecks_tests/test_target_distribution_data_check.py	`100.0% <ø> (ø)`
..._tests/test_data_checks_and_actions_integration.py	`100.0% <ø> (ø)`
... and 18 more

... and 32 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

tamargrey · 2023-03-20T17:12:48Z

evalml/tests/component_tests/test_oversampler.py

+    bool_ltype = "BooleanNullable" if use_nullable else "Boolean"
+    X.ww.set_types({0: "Categorical"})
+    X.ww[len(X.columns)] = ww.init_series(
+        pd.Series([True, False] * 50),


This test is trying to artificially add Categorical and Boolean columns to X_y_binary which is all float columns. The Boolean transformation works, but BooleanNullable does not allow converting from numeric values that aren't 1s or 0s. So I changed this test to set one of the float columns to be Categorical and add in new Boolean column of actual boolean values. Similar changes had to be made to test_smotenc_boolean_numeric for the same reasons.

tamargrey · 2023-03-20T17:16:31Z

evalml/tests/integration_tests/test_data_checks_and_actions_integration.py

@@ -231,7 +231,7 @@ def test_data_checks_impute_cols(problem_type):
        expected_pipeline_class = MulticlassClassificationPipeline
        y_expected = ww.init_series(
            pd.Series([0, 1, 2, 2, 2]),
-            logical_type="Integer",
+            logical_type="IntegerNullable",


These testing changes are due to the target imputer changes to maintain original logical type if possible

tamargrey · 2023-03-20T17:28:26Z

evalml/tests/pipeline_tests/test_component_graph.py

@@ -2746,6 +2746,6 @@ def test_get_component_input_logical_types():
    no_estimator.fit(X, y)
    assert no_estimator.last_component_input_logical_types == {
        "cat": Categorical(),
-        "numeric": Double(),
+        "numeric": Integer(),


This is a side effect of the change to Imputer to keep logical types where possible

tamargrey · 2023-03-20T17:31:15Z

evalml/tests/model_understanding_tests/test_partial_dependence.py

+    "X_has_nans",
+    [True, False],
+)
+def test_partial_dependence_with_nullable_types(


Added this test because mquantiles, which we use in _grid_from_X, has a nullable type incompatibility when null values are present in Int64 or boolean data.

This isn't actually going to be problematic, though, because the incompatibility is only when null values are present in the data, and remove any nans prior to passing the data into mquantiles https://github.com/alteryx/evalml/blob/main/evalml/model_understanding/_partial_dependence_utils.py#L178-L185. But I wanted to add this test in case we ever changed how we use mquantiles.

tamargrey · 2023-03-20T17:49:36Z

evalml/pipelines/components/transformers/imputers/imputer.py

+            # Numeric imputing may have changed logical types because of use of _get_new_logical_types_for_imputed_data
+            # --> this is only needed because of test_imputer_does_not_erase_ww_info test - otherwise we always assume ww types would be present
+            if imputed.ww.schema is None:
+                imputed.ww.init()


My preference would be that imputers called by Imputer should all handle the woodwork initialization and type changes needed according to the specific impute behavior. However, the test test_imputer_does_not_erase_ww_info, from what I can tell, is specifically testing that if the imputers called by Imputer don't initialize woodwork, that the Imputer component still will and will maintain types.

So I added this woodwork init to handle that case to honor the spirit of that test. But I'd be happy to remove the initalization logic here and change or remove test_imputer_does_not_erase_ww_info if people agree that it doesn't feel necessary to keep.

The does_not_erase_ww_info tests came from an error where some of our components would drop typing information, so we had to waste time reinferring and we potentially lost typing information. It would make sense to try and delegate all typing work to the imputer subclasses, but I think it's safer in the long run to keep ensuring typing info here as well. It'd be great if we could initialize more intelligently to avoid inference where possible, but I'm not sure if that's doable here.

A related clarifying question, we don't need to do anything with boolean columns here because they won't need to change like int->float does and we have all the boolean nullable handling we need in other components, right?

Got it - so because we're doing a ww.init() call here, it does open up the possibility of reinferring and losing types if the simple imputer stopped maintaining the woodwork schema, but, like you say, there's not a great way to avoid this, and at least it would just be the numeric columns. The good thing is that we don't expect to actually need to init woodwork here, so we're okay leaving it as is.

And to answer your question: that is correct. We can fully maintain all the non integer types, and that allows us to not have any similar logic for any of the other types of columns.

tamargrey · 2023-03-20T17:52:36Z

evalml/tests/integration_tests/test_nullable_types.py

+    )
+    # --> should probably use the automl test env but im not sure if its okay to have the score return value be set?
+    env = AutoMLTestEnv(problem_type)
+    with env.test_context(score_return_value={automl.objective.name: 1.0}):


I just want to double check that using the automl test env this way doesnt defeat the purpose of this test. I figured we wanted to use it otherwise these tests take forever.

I think you should be good :)

tamargrey · 2023-03-20T17:54:41Z

evalml/tests/pipeline_tests/test_pipeline_utils.py

    assert (
        combined_pipeline.component_graph.get_inputs("First Pipeline - Imputer")[0]
-        == "First Pipeline - Replace Nullable Types Transformer.x"
+        == "DFS Transformer.x"


Not 100% this is the right check, but it seems like the main purpose was to confirm the order of components was correct, for which the DFS Transformer could be used now that the replace nullable types transformer is gone.

tamargrey · 2023-03-20T17:57:23Z

evalml/tests/automl_tests/test_default_algorithm.py

-        )
-        assert order.index(
-            f"{dtype} Pipeline - Replace Nullable Types Transformer",
-        ) < order.index(f"{dtype} Pipeline - Imputer")


As far as I can tell, this is testing that in categorical and numeric split pipelines, the replace nullable types transformer comes after the label encoder. So now that replace nullable types transformer is gone, there's no need for such a check? If the goal of the test was instead about the placement of the label encoder, I can add in a different check, but I feel like the assert order[0] == "Label Encoder" if not features else "DFS Transformer" check above should cover that..

Ah, this was a recent patch on my end, I should have left a comment. The goal of this test is to ensure that splitting the pipeline doesn't mess up the computation order. Some components need to be computed before others, so we need to ensure that that still occurs.

The original test was asserting the exact computation order of all the components, which changed when I upgraded our networkx version. I changed it to focus just on the "what components need to be computed before other components" part, relaxing the strict order requirement. We should still keep this test to some degree, but we can skip the Replace Nullable Types Transformer part of it.

Got it, so I'll keep this as is with the assert order[0] == "Label Encoder" if not features else "DFS Transformer" check in this test maintaining coverage of the expected order of things!

eccabay

Looks pretty good, just a few small things!

eccabay · 2023-03-21T13:05:29Z

evalml/pipelines/components/transformers/imputers/knn_imputer.py

@@ -95,23 +98,18 @@ def transform(self, X, y=None):
        X_t = self._component_obj.transform(X[self._cols_to_impute])
        X_t = pd.DataFrame(X_t, columns=self._cols_to_impute)

-        # Get Woodwork types for the imputed data
-        new_schema = original_schema.get_subset_schema(self._cols_to_impute)


Was there a specific reason you decided to get rid of new_schema here but keep it in the simple imputer?

No specific reason. I'm now realizing how similar the knn and simple imputers' transform methods really are. Really the only difference is the bool logic in the simple imputer to return early if needed.

I can refactor them to follow the same flow of logic and have more similar variable names so that it's easier to read through them.

Implemented in 7057482 - this uncovered an undesired behavior when using only original_schema to reinit woodwork when fully null columns were present, so I stuck with the behavior of defining the new schema as its own variable but changed its name to be imputed_schema to hopefully be slightly more clear than new_schema was

eccabay · 2023-03-21T13:08:06Z

evalml/pipelines/components/transformers/imputers/imputer.py

+            # Numeric imputing may have changed logical types because of use of _get_new_logical_types_for_imputed_data
+            # --> this is only needed because of test_imputer_does_not_erase_ww_info test - otherwise we always assume ww types would be present
+            if imputed.ww.schema is None:
+                imputed.ww.init()


The does_not_erase_ww_info tests came from an error where some of our components would drop typing information, so we had to waste time reinferring and we potentially lost typing information. It would make sense to try and delegate all typing work to the imputer subclasses, but I think it's safer in the long run to keep ensuring typing info here as well. It'd be great if we could initialize more intelligently to avoid inference where possible, but I'm not sure if that's doable here.

A related clarifying question, we don't need to do anything with boolean columns here because they won't need to change like int->float does and we have all the boolean nullable handling we need in other components, right?

eccabay · 2023-03-21T13:19:28Z

evalml/pipelines/components/transformers/imputers/utils.py

+from evalml.utils.nullable_type_utils import _determine_fractional_type
+
+
+def _get_new_logical_types_for_imputed_data(


I really like pulling all the typing into a single function here, all the imputers are so much easier to read now.

It feels a little odd to add a whole new file just for one function. I really wanted to be able to add this as a static method to a parent imputer class that they all inherit from, but we don't have that set up at this point. I also like to avoid calling private functions outside of the file where they were defined. What are your thoughts on moving this to nullable_types_utils? This function seems like it fits under the nullable util umbrella.

that's originally where I had it--I can never decide where to put things! I'll happily move it back there

done - af45447

eccabay · 2023-03-21T13:53:01Z

evalml/tests/automl_tests/test_default_algorithm.py

-        )
-        assert order.index(
-            f"{dtype} Pipeline - Replace Nullable Types Transformer",
-        ) < order.index(f"{dtype} Pipeline - Imputer")


Ah, this was a recent patch on my end, I should have left a comment. The goal of this test is to ensure that splitting the pipeline doesn't mess up the computation order. Some components need to be computed before others, so we need to ensure that that still occurs.

The original test was asserting the exact computation order of all the components, which changed when I upgraded our networkx version. I changed it to focus just on the "what components need to be computed before other components" part, relaxing the strict order requirement. We should still keep this test to some degree, but we can skip the Replace Nullable Types Transformer part of it.

eccabay · 2023-03-21T13:58:57Z

evalml/tests/integration_tests/test_nullable_types.py

-    assert all([ReplaceNullableTypes.name in pl for pl in pipelines])
+
+    # Confirm ReplaceNullableTypes transformer isn't used
+    assert not any([ReplaceNullableTypes.name in pl for pl in pipelines])


evalml/tests/component_tests/test_knn_imputer.py

…lities in methods

…le_nullable_types instead

…tests

chukarsten · 2023-03-22T17:29:33Z

evalml/tests/component_tests/test_imputer.py

@@ -603,6 +607,8 @@ def test_imputer_int_preserved():
        0: Double,
    }

+    # If no null values need to be imputed, integer column should be maintained


Appreciate these comments - good practice.

chukarsten · 2023-03-22T17:30:31Z

evalml/tests/component_tests/test_knn_imputer.py

@@ -34,7 +34,7 @@ def test_knn_imputer_ignores_natural_language(
    imputer_test_data,
    df_composition,
 ):
-    """Test to ensure that the simple imputer just passes through


You can scold @MichaelFu512 for this when he comes back :)

lol I have no feet to stand on -- I did the exact same thing and @eccabay had to correct me!

chukarsten · 2023-03-22T18:48:51Z

evalml/tests/integration_tests/test_nullable_types.py

+    )
+    # --> should probably use the automl test env but im not sure if its okay to have the score return value be set?
+    env = AutoMLTestEnv(problem_type)
+    with env.test_context(score_return_value={automl.objective.name: 1.0}):


I think you should be good :)

* Update tests * remove unnecessary comments * resolve remaining comments * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * Handle new oversampler nullable type incompatibility and add tests * Remove existing nullable type handling from oversampler and use _handle_nullable_types instead * Add comments to self * Add handle call for lgbm classifier * lint fix * Only call handle_nullable_types when necessary in methods * Remove remaining unnecessary handle calls * resolve remaining comments * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * remove nullable type handling after sklearn upgrade fixed incompatibilities * Remove downcast call from imputer.py and fix tests * remove downcast call from catboost regressor * stop adding ohe if bool null present * remove nullable type handling from knn imputer * remove nullable type handling from targer imputer * use util for ltype deciding in simple imputer * Fix broken target imputer test * Remove replace nullable types transformer from automl search and fix tests * Handle nullable types in lgbm classifier predict_proba * fix after rebase * move util to imputer utils and other fixes * remove duplicate util * Change how imputer reinits woodwork types * Expand tests * use automl test env * fix broken test * Clean up * Add release note * Further clean up and note incompatibility * Continue clean up * fix failing docstring * remove remaining comments * fix invali target data check docstring * let knn and simple imputers use same flow of logic for readability * Move _get_new_logical_types_for_imputed_data to nullable type utils file * PR comment

* initial commit to make the PR * Refactor imputer components to remove unnecessary logic (#4038) * Stop using woodwork describe to get nan info in time series imputer * remove logic that's no longer needed and return if all bool dtype * remove unnecessary logic from target imputer * remove unused utils * remove logic to convert dfs features to categorical logical type * fix email featureizer test * Revert changes to transfomr prim components for testing * Revert "Revert changes to transfomr prim components for testing" This reverts commit 57dda43. * Fix bugs with ww not imputed and string categories * Add release note * Handle case where nans are present at transform in previously all bool dtype data * Stop truncating ints in target imputer * clean up * Fix tests * Keep ltype integer for most frequent impute tpye in target imputer * refactor knn imputer to use new logic * Fix list bug * remove comment * Update release note to mention nullable types * remove outdated comment * Conver all bool dfs to boolean nullable instead of refitting * lint fix * PR comments * add second bool col to imputer fixture * Use nullable type handling in components' fit, transform, and predict methods (#4046) * Remove existing nullable type handling from oversampler and use _handle_nullable_types instead * Add handle call for lgbm regressor and remove existing handling * Add handle call for lgbm classifier * temp broken exp smoothing tests * lint fix * add release note * Fix broken tests by initting woodwork on y in lgbm classifier * Update tests * Call handle in arima * call handle from ts imputer y ltype is downcasted value * remove unnecessary comments * Fix time series guide * lint fix * Only call handle_nullable_types when necessary in methods * Remove remaining unnecessary handle calls * resolve remaining comments * Add y ww init to ts imputer back in to fix tests * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * use X_d in lgbm predict proba * remove nullable type handling after sklearn upgrade fixed incompatibilities * use common util to determine type for time series imputed integers * Add comments around why we copy X * remove _prepare_data from samplers * PR comments * remove tests to check if handle method is called * remove nullable types from imputed data because of regularizer * fix typo * fix docstrings * fix codecov issues * PR comments * Revert "Fix time series guide" This reverts commit 964622a. * return unchanged ltype in nullabl;e type utils * add back ts imputer incompatibility test * use dict get return value * call handle nullable types in oversampler and check schema equality * Remove other nullable handling across AutoMLSearch (#4085) * Update tests * remove unnecessary comments * resolve remaining comments * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * Handle new oversampler nullable type incompatibility and add tests * Remove existing nullable type handling from oversampler and use _handle_nullable_types instead * Add comments to self * Add handle call for lgbm classifier * lint fix * Only call handle_nullable_types when necessary in methods * Remove remaining unnecessary handle calls * resolve remaining comments * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * remove nullable type handling after sklearn upgrade fixed incompatibilities * Remove downcast call from imputer.py and fix tests * remove downcast call from catboost regressor * stop adding ohe if bool null present * remove nullable type handling from knn imputer * remove nullable type handling from targer imputer * use util for ltype deciding in simple imputer * Fix broken target imputer test * Remove replace nullable types transformer from automl search and fix tests * Handle nullable types in lgbm classifier predict_proba * fix after rebase * move util to imputer utils and other fixes * remove duplicate util * Change how imputer reinits woodwork types * Expand tests * use automl test env * fix broken test * Clean up * Add release note * Further clean up and note incompatibility * Continue clean up * fix failing docstring * remove remaining comments * fix invali target data check docstring * let knn and simple imputers use same flow of logic for readability * Move _get_new_logical_types_for_imputed_data to nullable type utils file * PR comment * Handle decomposer incompatibility (#4105) * Handle decomposer incompatibility * fix comment * Stop scaling the values up * Add release note * PR Comments * Create incompatibility for testing more similarly to how it shows up in the decomposer * Add release note * lint fix * pr notes * fix after rebase --------- Co-authored-by: chukarsten <64713315+chukarsten@users.noreply.github.com>

tamargrey changed the base branch from main to component-specific-nullable-types-handling March 17, 2023 20:32

tamargrey mentioned this pull request Mar 17, 2023

[DO NOT MERGE] Remove existing nullable type handlings across automl #4055

Closed

tamargrey force-pushed the remove-other-nullable-handling branch from a020d77 to 3a1b016 Compare March 20, 2023 14:53

tamargrey force-pushed the component-specific-nullable-types-handling branch from 9541e99 to 07b89dc Compare March 20, 2023 16:48

tamargrey force-pushed the remove-other-nullable-handling branch from 3a1b016 to 00f418b Compare March 20, 2023 16:48

tamargrey commented Mar 20, 2023

View reviewed changes

tamargrey marked this pull request as ready for review March 20, 2023 18:12

auto-assign bot assigned tamargrey Mar 20, 2023

tamargrey requested review from eccabay, jeremyliweishih and chukarsten March 20, 2023 19:11

eccabay reviewed Mar 21, 2023

View reviewed changes

tamargrey requested a review from eccabay March 21, 2023 20:27

eccabay approved these changes Mar 21, 2023

View reviewed changes

evalml/tests/component_tests/test_knn_imputer.py Outdated Show resolved Hide resolved

tamargrey force-pushed the component-specific-nullable-types-handling branch from 07b89dc to 63f12da Compare March 22, 2023 14:42

Tamar Grey added 7 commits March 22, 2023 10:43

Update tests

1fe1954

remove unnecessary comments

6299a48

resolve remaining comments

f1f9d23

Copy X in testing nullable types to stop hiding potential incompatibi…

07656d0

…lities in methods

Handle new oversampler nullable type incompatibility and add tests

7589d79

Remove existing nullable type handling from oversampler and use _hand…

82368ab

…le_nullable_types instead

Add comments to self

0e76a12

Tamar Grey added 22 commits March 22, 2023 10:43

remove nullable type handling from targer imputer

ed46700

use util for ltype deciding in simple imputer

a667847

Fix broken target imputer test

44e7e25

Remove replace nullable types transformer from automl search and fix …

123d4c1

…tests

Handle nullable types in lgbm classifier predict_proba

b900812

fix after rebase

edddd7f

move util to imputer utils and other fixes

2d62e59

remove duplicate util

4189a03

Change how imputer reinits woodwork types

f8765ba

Expand tests

df5e2d4

use automl test env

9ead475

fix broken test

ba80fad

Clean up

20b8188

Add release note

770140c

Further clean up and note incompatibility

8f3c4bb

Continue clean up

ecc8cd3

fix failing docstring

fe5c57a

remove remaining comments

374691d

fix invali target data check docstring

fff4613

let knn and simple imputers use same flow of logic for readability

898e18a

Move _get_new_logical_types_for_imputed_data to nullable type utils file

e6fa22d

PR comment

8599f7a

tamargrey force-pushed the remove-other-nullable-handling branch from 0ee6b8f to 8599f7a Compare March 22, 2023 14:43

chukarsten approved these changes Mar 22, 2023

View reviewed changes

tamargrey merged commit 75a4714 into component-specific-nullable-types-handling Mar 23, 2023

tamargrey deleted the remove-other-nullable-handling branch March 23, 2023 13:48

chukarsten mentioned this pull request Apr 3, 2023

Release v0.73.0. #4122

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove other nullable handling across AutoMLSearch #4085

Remove other nullable handling across AutoMLSearch #4085

tamargrey commented Mar 17, 2023 •

edited

Loading

codecov bot commented Mar 17, 2023 •

edited

Loading

tamargrey Mar 20, 2023

tamargrey Mar 20, 2023

tamargrey Mar 20, 2023

tamargrey Mar 20, 2023

tamargrey Mar 20, 2023

eccabay Mar 21, 2023

tamargrey Mar 21, 2023

tamargrey Mar 20, 2023

chukarsten Mar 22, 2023

tamargrey Mar 20, 2023

tamargrey Mar 20, 2023

eccabay Mar 21, 2023

tamargrey Mar 21, 2023

eccabay left a comment

eccabay Mar 21, 2023

tamargrey Mar 21, 2023

tamargrey Mar 21, 2023

eccabay Mar 21, 2023

eccabay Mar 21, 2023

tamargrey Mar 21, 2023

tamargrey Mar 21, 2023

eccabay Mar 21, 2023

eccabay Mar 21, 2023

chukarsten Mar 22, 2023

chukarsten Mar 22, 2023

tamargrey Mar 22, 2023

chukarsten Mar 22, 2023

		from evalml.utils.nullable_type_utils import _determine_fractional_type


		def _get_new_logical_types_for_imputed_data(

Remove other nullable handling across AutoMLSearch #4085

Remove other nullable handling across AutoMLSearch #4085

Conversation

tamargrey commented Mar 17, 2023 • edited Loading

codecov bot commented Mar 17, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eccabay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tamargrey commented Mar 17, 2023 •

edited

Loading

codecov bot commented Mar 17, 2023 •

edited

Loading