Knn imputer implementation#3662
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3662 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 337 339 +2
Lines 34078 34239 +161
=======================================
+ Hits 33947 34108 +161
Misses 131 131
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
…yx/evalml into knn_imputer_implementation Git pull
…yx/evalml into knn_imputer_implementation
…yx/evalml into knn_imputer_implementation
…yx/evalml into knn_imputer_implementation
eccabay
left a comment
There was a problem hiding this comment.
Looking pretty good! Agreed with Jeremy's comments though. I'd also love for you to take a closer look at the tests again - make sure we're testing the new behavior and the weird edge cases, but not throwing too much compute power at unnecessary tests. In general as well, I'd recommend leveraging our pytest fixtures for data creation whenever possible.
| ------------- | ||
| **Future Releases** | ||
| * Enhancements | ||
| * Added KNNImputer class and created new knn parameter for Imputer? :pr:`3662`` |
There was a problem hiding this comment.
Formatting - need the double tick marks around KNNImputer, plus we don't need the question mark or the extra tick mark at the end.
| """Imputes missing data using KNN according to a specified number of neighbors. Natural language columns are ignored. | ||
|
|
||
| Args: | ||
| number_neighbors (int): Number of nearest neighbors for KNN to search for. Defaults to 0. |
There was a problem hiding this comment.
Typo: defaults to 0 -> defaults to 3
| def _drop_natural_language_columns(self, X): | ||
| natural_language_columns = list( | ||
| X.ww.select(["NaturalLanguage"], return_schema=True).columns.keys(), | ||
| ) | ||
| if natural_language_columns: | ||
| X = X.ww.copy() | ||
| X = X.ww.drop(columns=natural_language_columns) | ||
| return X, natural_language_columns | ||
|
|
||
| def _set_boolean_columns_to_categorical(self, X): | ||
| X_schema = X.ww.schema | ||
| original_X_schema = X_schema.get_subset_schema( | ||
| subset_cols=X_schema._filter_cols(exclude=["Boolean"]), | ||
| ) | ||
| X_boolean_cols = X_schema._filter_cols(include=["Boolean"]) | ||
| new_ltypes_for_boolean_cols = {col: "Categorical" for col in X_boolean_cols} | ||
| X.ww.init(schema=original_X_schema, logical_types=new_ltypes_for_boolean_cols) | ||
| return X |
There was a problem hiding this comment.
These functions are copied from the SimpleImputer but don't rely on being a class method, could you move these into a utils file (both for this class and the SimpleImputer)? The higher level components/utils.py would probably be fine.
|
|
||
| # Drop natural language columns and transform the other columns | ||
| X_t, natural_language_cols = self._drop_natural_language_columns(X) | ||
| if X_t.shape[-1] == 0: |
There was a problem hiding this comment.
Nit: Just a few lines higher you check X.shape[1] while here you check X_t.shape[-1]. Can you flip this to be consistent?
| "numeric_impute_strategy": ["mean", "median", "most_frequent", "knn"], | ||
| "boolean_impute_strategy": ["most_frequent", "knn"], |
There was a problem hiding this comment.
We're going to need more tests here than just checking that knn is in the hyperparameters. At a minimum, I'd test when a dtype strategy is set to knn that the class attribute is a KNN imputer with the expected parameters.
| from evalml.pipelines.components.transformers.imputers import KNNImputer | ||
|
|
||
|
|
||
| def test_knn_imputer_1_neighbor(): |
There was a problem hiding this comment.
We don't need to test this. Fundamentally, we can rely on sklearn to have implemented n_neighbors correctly. A better thing to test here would be to check that our output matches the sklearn object's output, and we can parametrize over the number of neighbors if you're worried about that as well.
Pseudocode:
@pytest.mark.parametrize("n_neighbors", [1, 2, 5])
def test_knn_output(n_neighbors):
X = ...
sk_knn = SKKNN(n_neighbors)
knn = KNN(n_neighbors)
assert sk_knn.fit_transform(X) == knn.fit_transform(X)
| assert_frame_equal(X_multi_expected_arr, X_multi_t) | ||
|
|
||
|
|
||
| def test_knn_imputer_all_bool(): |
There was a problem hiding this comment.
This test is almost functionally identical to test_knn_imputer_all_bool_return_original, we don't need both
…yx/evalml into knn_imputer_implementation
eccabay
left a comment
There was a problem hiding this comment.
Looking really good! Just a few small things to fix before this goes in
| original_X_schema = X_schema.get_subset_schema( | ||
| subset_cols=X_schema._filter_cols(exclude=["BooleanNullable"]), | ||
| ) | ||
| X_bool_nullable_cols = X_schema._filter_cols(include=["Categorical"]) | ||
| new_ltypes_for_bool_nullable_cols = { | ||
| col: "Boolean" for col in X_bool_nullable_cols | ||
| } |
There was a problem hiding this comment.
I'm very confused by this and the lower code messing with boolean data types. What exactly what the goal here? It looks like you may be able to replace all this with a call to downcast_nullable_types.
There was a problem hiding this comment.
I rewrote the code above since honestly it confused even me reading it now. downcast_nullable_types turns type booleans into doubles which I would prefer not to have happen.
There was a problem hiding this comment.
Hm, downcast_nullable_types appears to turn BooleanNullable into boolean, if you were getting doubles that seems like a problem we should address. It's not relevant to this issue though, and the cleaned up code is a lot easier to read so I'm fine with it!
| return component_class | ||
|
|
||
|
|
||
| def drop_natural_language_columns(X): |
There was a problem hiding this comment.
Thanks for doing this! It'd be great if you could add test cases for this function as well as set_boolean_columns_to_categorical.
jeremyliweishih
left a comment
There was a problem hiding this comment.
Good work! Should be good to go after some more testing changes as well as Becca's comments.
| ) | ||
| if boolean_impute_strategy == "knn": | ||
| self._boolean_imputer = KNNImputer( | ||
| number_neighbors=1, |
There was a problem hiding this comment.
I think thats a good idea - we should be able to up the number of neighbors. Can you file an issue for this?
| imputer = KNNImputer() | ||
| imputer.fit(X, y) | ||
| X_t = imputer.transform(X) | ||
| assert_frame_equal(X_expected_arr, X_t) |
There was a problem hiding this comment.
let's also check the woodwork logical types are the same here.
| "some_nan": pd.Series([True, np.nan, False, np.nan, True], dtype="boolean"), | ||
| }, | ||
| ) | ||
| X.ww.init(logical_types={"some_nan": "BooleanNullable"}) |
There was a problem hiding this comment.
should we only init here based off if data_type == ww? we should either init or run make_data_type later.
| imputer = KNNImputer(number_neighbors=1) | ||
| X_t = imputer.fit_transform(X, y) | ||
| X_ww_expected = "<Series: some_nan (Physical Type = bool) (Logical Type = Boolean) (Semantic Tags = set())>" | ||
| assert str(X_t.ww["some_nan"].ww) == X_ww_expected |
There was a problem hiding this comment.
you can just assert that the logical type is Boolean by indexing into X_t.ww
| "bool no nan": pd.Series([False, False, False, False, True], dtype=bool), | ||
| }, | ||
| ) | ||
| X_multi.ww.init(logical_types={"bool with nan": "BooleanNullable"}) |
There was a problem hiding this comment.
same question about init and make_data_type here.
| imputer = KNNImputer(number_neighbors=1) | ||
| imputer.fit(X_multi, y) | ||
| X_multi_t = imputer.transform(X_multi) | ||
| assert_frame_equal(X_multi_expected_arr, X_multi_t) |
There was a problem hiding this comment.
let's assert that the logical types make sense here too.
| assert_frame_equal(X_multi_expected_arr, X_multi_t) | ||
|
|
||
|
|
||
| def test_knn_imputer_revert_categorical_to_boolean(): |
There was a problem hiding this comment.
is this test to check if BooleanNullables will be converted to Boolean? If so let's parameterize this method to use pandas and ww as well and then check the logical types in the WW case.
There was a problem hiding this comment.
Line 92 was to test if I could revert categorical to boolean. However KNNImputer doesn't have the same issues that SimpleImputer has if you pass in just bools (no nulls), therefore I can get rid of line 92's test (since I don't have to actually change bool columns that are all bools into categorical).
|
|
||
| result = imputer.transform(X_df, y) | ||
|
|
||
| # raise Exception |
There was a problem hiding this comment.
is this comment necessary?
eccabay
left a comment
There was a problem hiding this comment.
Awesome, thanks for powering through this! Let's get this in 😎
Pull Request Description
This is the KNNImputer. I've added it to the hyper-parameter range in
Imputer.py.After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rstto include this pull request by adding :pr:123.