-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Knn imputer implementation #3662
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3662 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 337 339 +2
Lines 34078 34239 +161
=======================================
+ Hits 33947 34108 +161
Misses 131 131
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
…yx/evalml into knn_imputer_implementation Git pull
…yx/evalml into knn_imputer_implementation
…yx/evalml into knn_imputer_implementation
…yx/evalml into knn_imputer_implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking pretty good! Agreed with Jeremy's comments though. I'd also love for you to take a closer look at the tests again - make sure we're testing the new behavior and the weird edge cases, but not throwing too much compute power at unnecessary tests. In general as well, I'd recommend leveraging our pytest fixtures for data creation whenever possible.
docs/source/release_notes.rst
Outdated
@@ -2,6 +2,7 @@ Release Notes | |||
------------- | |||
**Future Releases** | |||
* Enhancements | |||
* Added KNNImputer class and created new knn parameter for Imputer? :pr:`3662`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting - need the double tick marks around KNNImputer
, plus we don't need the question mark or the extra tick mark at the end.
"""Imputes missing data using KNN according to a specified number of neighbors. Natural language columns are ignored. | ||
|
||
Args: | ||
number_neighbors (int): Number of nearest neighbors for KNN to search for. Defaults to 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: defaults to 0 -> defaults to 3
def _drop_natural_language_columns(self, X): | ||
natural_language_columns = list( | ||
X.ww.select(["NaturalLanguage"], return_schema=True).columns.keys(), | ||
) | ||
if natural_language_columns: | ||
X = X.ww.copy() | ||
X = X.ww.drop(columns=natural_language_columns) | ||
return X, natural_language_columns | ||
|
||
def _set_boolean_columns_to_categorical(self, X): | ||
X_schema = X.ww.schema | ||
original_X_schema = X_schema.get_subset_schema( | ||
subset_cols=X_schema._filter_cols(exclude=["Boolean"]), | ||
) | ||
X_boolean_cols = X_schema._filter_cols(include=["Boolean"]) | ||
new_ltypes_for_boolean_cols = {col: "Categorical" for col in X_boolean_cols} | ||
X.ww.init(schema=original_X_schema, logical_types=new_ltypes_for_boolean_cols) | ||
return X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions are copied from the SimpleImputer
but don't rely on being a class method, could you move these into a utils file (both for this class and the SimpleImputer)? The higher level components/utils.py
would probably be fine.
|
||
# Drop natural language columns and transform the other columns | ||
X_t, natural_language_cols = self._drop_natural_language_columns(X) | ||
if X_t.shape[-1] == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Just a few lines higher you check X.shape[1]
while here you check X_t.shape[-1]
. Can you flip this to be consistent?
"numeric_impute_strategy": ["mean", "median", "most_frequent", "knn"], | ||
"boolean_impute_strategy": ["most_frequent", "knn"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're going to need more tests here than just checking that knn is in the hyperparameters. At a minimum, I'd test when a dtype strategy is set to knn
that the class attribute is a KNN imputer with the expected parameters.
from evalml.pipelines.components.transformers.imputers import KNNImputer | ||
|
||
|
||
def test_knn_imputer_1_neighbor(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to test this. Fundamentally, we can rely on sklearn to have implemented n_neighbors
correctly. A better thing to test here would be to check that our output matches the sklearn object's output, and we can parametrize over the number of neighbors if you're worried about that as well.
Pseudocode:
@pytest.mark.parametrize("n_neighbors", [1, 2, 5])
def test_knn_output(n_neighbors):
X = ...
sk_knn = SKKNN(n_neighbors)
knn = KNN(n_neighbors)
assert sk_knn.fit_transform(X) == knn.fit_transform(X)
assert_frame_equal(X_multi_expected_arr, X_multi_t) | ||
|
||
|
||
def test_knn_imputer_all_bool(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is almost functionally identical to test_knn_imputer_all_bool_return_original
, we don't need both
…yx/evalml into knn_imputer_implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really good! Just a few small things to fix before this goes in
original_X_schema = X_schema.get_subset_schema( | ||
subset_cols=X_schema._filter_cols(exclude=["BooleanNullable"]), | ||
) | ||
X_bool_nullable_cols = X_schema._filter_cols(include=["Categorical"]) | ||
new_ltypes_for_bool_nullable_cols = { | ||
col: "Boolean" for col in X_bool_nullable_cols | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very confused by this and the lower code messing with boolean data types. What exactly what the goal here? It looks like you may be able to replace all this with a call to downcast_nullable_types
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote the code above since honestly it confused even me reading it now. downcast_nullable_types
turns type booleans into doubles which I would prefer not to have happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, downcast_nullable_types
appears to turn BooleanNullable
into boolean, if you were getting doubles that seems like a problem we should address. It's not relevant to this issue though, and the cleaned up code is a lot easier to read so I'm fine with it!
@@ -163,6 +163,45 @@ def handle_component_class(component_class): | |||
return component_class | |||
|
|||
|
|||
def drop_natural_language_columns(X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! It'd be great if you could add test cases for this function as well as set_boolean_columns_to_categorical
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work! Should be good to go after some more testing changes as well as Becca's comments.
) | ||
if boolean_impute_strategy == "knn": | ||
self._boolean_imputer = KNNImputer( | ||
number_neighbors=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think thats a good idea - we should be able to up the number of neighbors. Can you file an issue for this?
imputer = KNNImputer() | ||
imputer.fit(X, y) | ||
X_t = imputer.transform(X) | ||
assert_frame_equal(X_expected_arr, X_t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also check the woodwork logical types are the same here.
"some_nan": pd.Series([True, np.nan, False, np.nan, True], dtype="boolean"), | ||
}, | ||
) | ||
X.ww.init(logical_types={"some_nan": "BooleanNullable"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we only init here based off if data_type == ww
? we should either init or run make_data_type
later.
imputer = KNNImputer(number_neighbors=1) | ||
X_t = imputer.fit_transform(X, y) | ||
X_ww_expected = "<Series: some_nan (Physical Type = bool) (Logical Type = Boolean) (Semantic Tags = set())>" | ||
assert str(X_t.ww["some_nan"].ww) == X_ww_expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just assert that the logical type is Boolean by indexing into X_t.ww
"bool no nan": pd.Series([False, False, False, False, True], dtype=bool), | ||
}, | ||
) | ||
X_multi.ww.init(logical_types={"bool with nan": "BooleanNullable"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question about init
and make_data_type
here.
imputer = KNNImputer(number_neighbors=1) | ||
imputer.fit(X_multi, y) | ||
X_multi_t = imputer.transform(X_multi) | ||
assert_frame_equal(X_multi_expected_arr, X_multi_t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's assert that the logical types make sense here too.
assert_frame_equal(X_multi_expected_arr, X_multi_t) | ||
|
||
|
||
def test_knn_imputer_revert_categorical_to_boolean(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this test to check if BooleanNullables
will be converted to Boolean
? If so let's parameterize this method to use pandas and ww as well and then check the logical types in the WW case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 92 was to test if I could revert categorical to boolean. However KNNImputer doesn't have the same issues that SimpleImputer has if you pass in just bools (no nulls), therefore I can get rid of line 92's test (since I don't have to actually change bool columns that are all bools into categorical).
|
||
result = imputer.transform(X_df, y) | ||
|
||
# raise Exception |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this comment necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No oops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for powering through this! Let's get this in 😎
Pull Request Description
This is the KNNImputer. I've added it to the hyper-parameter range in
Imputer.py
.After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rst
to include this pull request by adding :pr:123
.