-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove other nullable handling across AutoMLSearch #4085
Remove other nullable handling across AutoMLSearch #4085
Conversation
Codecov Report
@@ Coverage Diff @@
## component-specific-nullable-types-handling #4085 +/- ##
=============================================================================
+ Coverage 84.5% 99.7% +15.3%
=============================================================================
Files 349 349
Lines 37579 37607 +28
=============================================================================
+ Hits 31748 37490 +5742
+ Misses 5831 117 -5714
... and 32 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
a020d77
to
3a1b016
Compare
9541e99
to
07b89dc
Compare
3a1b016
to
00f418b
Compare
bool_ltype = "BooleanNullable" if use_nullable else "Boolean" | ||
X.ww.set_types({0: "Categorical"}) | ||
X.ww[len(X.columns)] = ww.init_series( | ||
pd.Series([True, False] * 50), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is trying to artificially add Categorical and Boolean columns to X_y_binary
which is all float columns. The Boolean
transformation works, but BooleanNullable
does not allow converting from numeric values that aren't 1s or 0s. So I changed this test to set one of the float columns to be Categorical and add in new Boolean column of actual boolean values. Similar changes had to be made to test_smotenc_boolean_numeric
for the same reasons.
@@ -231,7 +231,7 @@ def test_data_checks_impute_cols(problem_type): | |||
expected_pipeline_class = MulticlassClassificationPipeline | |||
y_expected = ww.init_series( | |||
pd.Series([0, 1, 2, 2, 2]), | |||
logical_type="Integer", | |||
logical_type="IntegerNullable", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These testing changes are due to the target imputer changes to maintain original logical type if possible
@@ -2746,6 +2746,6 @@ def test_get_component_input_logical_types(): | |||
no_estimator.fit(X, y) | |||
assert no_estimator.last_component_input_logical_types == { | |||
"cat": Categorical(), | |||
"numeric": Double(), | |||
"numeric": Integer(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a side effect of the change to Imputer
to keep logical types where possible
"X_has_nans", | ||
[True, False], | ||
) | ||
def test_partial_dependence_with_nullable_types( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this test because mquantiles
, which we use in _grid_from_X
, has a nullable type incompatibility when null values are present in Int64
or boolean
data.
This isn't actually going to be problematic, though, because the incompatibility is only when null values are present in the data, and remove any nans prior to passing the data into mquantiles
https://github.com/alteryx/evalml/blob/main/evalml/model_understanding/_partial_dependence_utils.py#L178-L185. But I wanted to add this test in case we ever changed how we use mquantiles
.
# Numeric imputing may have changed logical types because of use of _get_new_logical_types_for_imputed_data | ||
# --> this is only needed because of test_imputer_does_not_erase_ww_info test - otherwise we always assume ww types would be present | ||
if imputed.ww.schema is None: | ||
imputed.ww.init() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference would be that imputers called by Imputer
should all handle the woodwork initialization and type changes needed according to the specific impute behavior. However, the test test_imputer_does_not_erase_ww_info
, from what I can tell, is specifically testing that if the imputers called by Imputer
don't initialize woodwork, that the Imputer
component still will and will maintain types.
So I added this woodwork init to handle that case to honor the spirit of that test. But I'd be happy to remove the initalization logic here and change or remove test_imputer_does_not_erase_ww_info
if people agree that it doesn't feel necessary to keep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The does_not_erase_ww_info
tests came from an error where some of our components would drop typing information, so we had to waste time reinferring and we potentially lost typing information. It would make sense to try and delegate all typing work to the imputer subclasses, but I think it's safer in the long run to keep ensuring typing info here as well. It'd be great if we could initialize more intelligently to avoid inference where possible, but I'm not sure if that's doable here.
A related clarifying question, we don't need to do anything with boolean columns here because they won't need to change like int->float does and we have all the boolean nullable handling we need in other components, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it - so because we're doing a ww.init()
call here, it does open up the possibility of reinferring and losing types if the simple imputer stopped maintaining the woodwork schema, but, like you say, there's not a great way to avoid this, and at least it would just be the numeric columns. The good thing is that we don't expect to actually need to init woodwork here, so we're okay leaving it as is.
And to answer your question: that is correct. We can fully maintain all the non integer types, and that allows us to not have any similar logic for any of the other types of columns.
) | ||
# --> should probably use the automl test env but im not sure if its okay to have the score return value be set? | ||
env = AutoMLTestEnv(problem_type) | ||
with env.test_context(score_return_value={automl.objective.name: 1.0}): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to double check that using the automl test env this way doesnt defeat the purpose of this test. I figured we wanted to use it otherwise these tests take forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should be good :)
assert ( | ||
combined_pipeline.component_graph.get_inputs("First Pipeline - Imputer")[0] | ||
== "First Pipeline - Replace Nullable Types Transformer.x" | ||
== "DFS Transformer.x" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% this is the right check, but it seems like the main purpose was to confirm the order of components was correct, for which the DFS Transformer could be used now that the replace nullable types transformer is gone.
) | ||
assert order.index( | ||
f"{dtype} Pipeline - Replace Nullable Types Transformer", | ||
) < order.index(f"{dtype} Pipeline - Imputer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, this is testing that in categorical and numeric split pipelines, the replace nullable types transformer comes after the label encoder. So now that replace nullable types transformer is gone, there's no need for such a check? If the goal of the test was instead about the placement of the label encoder, I can add in a different check, but I feel like the assert order[0] == "Label Encoder" if not features else "DFS Transformer"
check above should cover that..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this was a recent patch on my end, I should have left a comment. The goal of this test is to ensure that splitting the pipeline doesn't mess up the computation order. Some components need to be computed before others, so we need to ensure that that still occurs.
The original test was asserting the exact computation order of all the components, which changed when I upgraded our networkx
version. I changed it to focus just on the "what components need to be computed before other components" part, relaxing the strict order requirement. We should still keep this test to some degree, but we can skip the Replace Nullable Types Transformer part of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, so I'll keep this as is with the assert order[0] == "Label Encoder" if not features else "DFS Transformer"
check in this test maintaining coverage of the expected order of things!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good, just a few small things!
@@ -95,23 +98,18 @@ def transform(self, X, y=None): | |||
X_t = self._component_obj.transform(X[self._cols_to_impute]) | |||
X_t = pd.DataFrame(X_t, columns=self._cols_to_impute) | |||
|
|||
# Get Woodwork types for the imputed data | |||
new_schema = original_schema.get_subset_schema(self._cols_to_impute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there a specific reason you decided to get rid of new_schema
here but keep it in the simple imputer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No specific reason. I'm now realizing how similar the knn and simple imputers' transform methods really are. Really the only difference is the bool
logic in the simple imputer to return early if needed.
I can refactor them to follow the same flow of logic and have more similar variable names so that it's easier to read through them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented in 7057482 - this uncovered an undesired behavior when using only original_schema
to reinit woodwork when fully null columns were present, so I stuck with the behavior of defining the new schema as its own variable but changed its name to be imputed_schema
to hopefully be slightly more clear than new_schema
was
# Numeric imputing may have changed logical types because of use of _get_new_logical_types_for_imputed_data | ||
# --> this is only needed because of test_imputer_does_not_erase_ww_info test - otherwise we always assume ww types would be present | ||
if imputed.ww.schema is None: | ||
imputed.ww.init() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The does_not_erase_ww_info
tests came from an error where some of our components would drop typing information, so we had to waste time reinferring and we potentially lost typing information. It would make sense to try and delegate all typing work to the imputer subclasses, but I think it's safer in the long run to keep ensuring typing info here as well. It'd be great if we could initialize more intelligently to avoid inference where possible, but I'm not sure if that's doable here.
A related clarifying question, we don't need to do anything with boolean columns here because they won't need to change like int->float does and we have all the boolean nullable handling we need in other components, right?
from evalml.utils.nullable_type_utils import _determine_fractional_type | ||
|
||
|
||
def _get_new_logical_types_for_imputed_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like pulling all the typing into a single function here, all the imputers are so much easier to read now.
It feels a little odd to add a whole new file just for one function. I really wanted to be able to add this as a static method to a parent imputer class that they all inherit from, but we don't have that set up at this point. I also like to avoid calling private functions outside of the file where they were defined. What are your thoughts on moving this to nullable_types_utils
? This function seems like it fits under the nullable util umbrella.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's originally where I had it--I can never decide where to put things! I'll happily move it back there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done - af45447
) | ||
assert order.index( | ||
f"{dtype} Pipeline - Replace Nullable Types Transformer", | ||
) < order.index(f"{dtype} Pipeline - Imputer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this was a recent patch on my end, I should have left a comment. The goal of this test is to ensure that splitting the pipeline doesn't mess up the computation order. Some components need to be computed before others, so we need to ensure that that still occurs.
The original test was asserting the exact computation order of all the components, which changed when I upgraded our networkx
version. I changed it to focus just on the "what components need to be computed before other components" part, relaxing the strict order requirement. We should still keep this test to some degree, but we can skip the Replace Nullable Types Transformer part of it.
assert all([ReplaceNullableTypes.name in pl for pl in pipelines]) | ||
|
||
# Confirm ReplaceNullableTypes transformer isn't used | ||
assert not any([ReplaceNullableTypes.name in pl for pl in pipelines]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
07b89dc
to
63f12da
Compare
…lities in methods
…le_nullable_types instead
0ee6b8f
to
8599f7a
Compare
@@ -603,6 +607,8 @@ def test_imputer_int_preserved(): | |||
0: Double, | |||
} | |||
|
|||
# If no null values need to be imputed, integer column should be maintained |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate these comments - good practice.
@@ -34,7 +34,7 @@ def test_knn_imputer_ignores_natural_language( | |||
imputer_test_data, | |||
df_composition, | |||
): | |||
"""Test to ensure that the simple imputer just passes through |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can scold @MichaelFu512 for this when he comes back :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol I have no feet to stand on -- I did the exact same thing and @eccabay had to correct me!
) | ||
# --> should probably use the automl test env but im not sure if its okay to have the score return value be set? | ||
env = AutoMLTestEnv(problem_type) | ||
with env.test_context(score_return_value={automl.objective.name: 1.0}): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should be good :)
* Update tests * remove unnecessary comments * resolve remaining comments * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * Handle new oversampler nullable type incompatibility and add tests * Remove existing nullable type handling from oversampler and use _handle_nullable_types instead * Add comments to self * Add handle call for lgbm classifier * lint fix * Only call handle_nullable_types when necessary in methods * Remove remaining unnecessary handle calls * resolve remaining comments * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * remove nullable type handling after sklearn upgrade fixed incompatibilities * Remove downcast call from imputer.py and fix tests * remove downcast call from catboost regressor * stop adding ohe if bool null present * remove nullable type handling from knn imputer * remove nullable type handling from targer imputer * use util for ltype deciding in simple imputer * Fix broken target imputer test * Remove replace nullable types transformer from automl search and fix tests * Handle nullable types in lgbm classifier predict_proba * fix after rebase * move util to imputer utils and other fixes * remove duplicate util * Change how imputer reinits woodwork types * Expand tests * use automl test env * fix broken test * Clean up * Add release note * Further clean up and note incompatibility * Continue clean up * fix failing docstring * remove remaining comments * fix invali target data check docstring * let knn and simple imputers use same flow of logic for readability * Move _get_new_logical_types_for_imputed_data to nullable type utils file * PR comment
* Update tests * remove unnecessary comments * resolve remaining comments * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * Handle new oversampler nullable type incompatibility and add tests * Remove existing nullable type handling from oversampler and use _handle_nullable_types instead * Add comments to self * Add handle call for lgbm classifier * lint fix * Only call handle_nullable_types when necessary in methods * Remove remaining unnecessary handle calls * resolve remaining comments * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * remove nullable type handling after sklearn upgrade fixed incompatibilities * Remove downcast call from imputer.py and fix tests * remove downcast call from catboost regressor * stop adding ohe if bool null present * remove nullable type handling from knn imputer * remove nullable type handling from targer imputer * use util for ltype deciding in simple imputer * Fix broken target imputer test * Remove replace nullable types transformer from automl search and fix tests * Handle nullable types in lgbm classifier predict_proba * fix after rebase * move util to imputer utils and other fixes * remove duplicate util * Change how imputer reinits woodwork types * Expand tests * use automl test env * fix broken test * Clean up * Add release note * Further clean up and note incompatibility * Continue clean up * fix failing docstring * remove remaining comments * fix invali target data check docstring * let knn and simple imputers use same flow of logic for readability * Move _get_new_logical_types_for_imputed_data to nullable type utils file * PR comment
* initial commit to make the PR * Refactor imputer components to remove unnecessary logic (#4038) * Stop using woodwork describe to get nan info in time series imputer * remove logic that's no longer needed and return if all bool dtype * remove unnecessary logic from target imputer * remove unused utils * remove logic to convert dfs features to categorical logical type * fix email featureizer test * Revert changes to transfomr prim components for testing * Revert "Revert changes to transfomr prim components for testing" This reverts commit 57dda43. * Fix bugs with ww not imputed and string categories * Add release note * Handle case where nans are present at transform in previously all bool dtype data * Stop truncating ints in target imputer * clean up * Fix tests * Keep ltype integer for most frequent impute tpye in target imputer * refactor knn imputer to use new logic * Fix list bug * remove comment * Update release note to mention nullable types * remove outdated comment * Conver all bool dfs to boolean nullable instead of refitting * lint fix * PR comments * add second bool col to imputer fixture * Use nullable type handling in components' fit, transform, and predict methods (#4046) * Remove existing nullable type handling from oversampler and use _handle_nullable_types instead * Add handle call for lgbm regressor and remove existing handling * Add handle call for lgbm classifier * temp broken exp smoothing tests * lint fix * add release note * Fix broken tests by initting woodwork on y in lgbm classifier * Update tests * Call handle in arima * call handle from ts imputer y ltype is downcasted value * remove unnecessary comments * Fix time series guide * lint fix * Only call handle_nullable_types when necessary in methods * Remove remaining unnecessary handle calls * resolve remaining comments * Add y ww init to ts imputer back in to fix tests * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * use X_d in lgbm predict proba * remove nullable type handling after sklearn upgrade fixed incompatibilities * use common util to determine type for time series imputed integers * Add comments around why we copy X * remove _prepare_data from samplers * PR comments * remove tests to check if handle method is called * remove nullable types from imputed data because of regularizer * fix typo * fix docstrings * fix codecov issues * PR comments * Revert "Fix time series guide" This reverts commit 964622a. * return unchanged ltype in nullabl;e type utils * add back ts imputer incompatibility test * use dict get return value * call handle nullable types in oversampler and check schema equality * Remove other nullable handling across AutoMLSearch (#4085) * Update tests * remove unnecessary comments * resolve remaining comments * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * Handle new oversampler nullable type incompatibility and add tests * Remove existing nullable type handling from oversampler and use _handle_nullable_types instead * Add comments to self * Add handle call for lgbm classifier * lint fix * Only call handle_nullable_types when necessary in methods * Remove remaining unnecessary handle calls * resolve remaining comments * Copy X in testing nullable types to stop hiding potential incompatibilities in methods * remove nullable type handling after sklearn upgrade fixed incompatibilities * Remove downcast call from imputer.py and fix tests * remove downcast call from catboost regressor * stop adding ohe if bool null present * remove nullable type handling from knn imputer * remove nullable type handling from targer imputer * use util for ltype deciding in simple imputer * Fix broken target imputer test * Remove replace nullable types transformer from automl search and fix tests * Handle nullable types in lgbm classifier predict_proba * fix after rebase * move util to imputer utils and other fixes * remove duplicate util * Change how imputer reinits woodwork types * Expand tests * use automl test env * fix broken test * Clean up * Add release note * Further clean up and note incompatibility * Continue clean up * fix failing docstring * remove remaining comments * fix invali target data check docstring * let knn and simple imputers use same flow of logic for readability * Move _get_new_logical_types_for_imputed_data to nullable type utils file * PR comment * Handle decomposer incompatibility (#4105) * Handle decomposer incompatibility * fix comment * Stop scaling the values up * Add release note * PR Comments * Create incompatibility for testing more similarly to how it shows up in the decomposer * Add release note * lint fix * pr notes * fix after rebase --------- Co-authored-by: chukarsten <64713315+chukarsten@users.noreply.github.com>
closes #3994, closes #3999, closes #4000