Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Knn imputer implementation #3662

Merged
merged 65 commits into from
Sep 2, 2022
Merged

Knn imputer implementation #3662

merged 65 commits into from
Sep 2, 2022

Conversation

MichaelFu512
Copy link
Contributor

@MichaelFu512 MichaelFu512 commented Aug 12, 2022

Pull Request Description

This is the KNNImputer. I've added it to the hyper-parameter range in Imputer.py.

After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Aug 12, 2022

Codecov Report

Merging #3662 (08dbad6) into main (49090e5) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3662     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        337     339      +2     
  Lines      34078   34239    +161     
=======================================
+ Hits       33947   34108    +161     
  Misses       131     131             
Impacted Files Coverage Δ
evalml/tests/automl_tests/test_automl.py 99.5% <ø> (ø)
...ts/automl_tests/test_automl_iterative_algorithm.py 100.0% <ø> (ø)
evalml/tests/pipeline_tests/test_pipelines.py 99.8% <ø> (ø)
...lines/components/transformers/imputers/__init__.py 100.0% <100.0%> (ø)
...elines/components/transformers/imputers/imputer.py 100.0% <100.0%> (ø)
...es/components/transformers/imputers/knn_imputer.py 100.0% <100.0%> (ø)
...components/transformers/imputers/simple_imputer.py 100.0% <100.0%> (ø)
evalml/pipelines/components/utils.py 95.8% <100.0%> (+0.5%) ⬆️
evalml/tests/component_tests/test_imputer.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_knn_imputer.py 100.0% <100.0%> (ø)
... and 2 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good! Agreed with Jeremy's comments though. I'd also love for you to take a closer look at the tests again - make sure we're testing the new behavior and the weird edge cases, but not throwing too much compute power at unnecessary tests. In general as well, I'd recommend leveraging our pytest fixtures for data creation whenever possible.

@@ -2,6 +2,7 @@ Release Notes
-------------
**Future Releases**
* Enhancements
* Added KNNImputer class and created new knn parameter for Imputer? :pr:`3662``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting - need the double tick marks around KNNImputer, plus we don't need the question mark or the extra tick mark at the end.

"""Imputes missing data using KNN according to a specified number of neighbors. Natural language columns are ignored.

Args:
number_neighbors (int): Number of nearest neighbors for KNN to search for. Defaults to 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: defaults to 0 -> defaults to 3

Comment on lines 39 to 56
def _drop_natural_language_columns(self, X):
natural_language_columns = list(
X.ww.select(["NaturalLanguage"], return_schema=True).columns.keys(),
)
if natural_language_columns:
X = X.ww.copy()
X = X.ww.drop(columns=natural_language_columns)
return X, natural_language_columns

def _set_boolean_columns_to_categorical(self, X):
X_schema = X.ww.schema
original_X_schema = X_schema.get_subset_schema(
subset_cols=X_schema._filter_cols(exclude=["Boolean"]),
)
X_boolean_cols = X_schema._filter_cols(include=["Boolean"])
new_ltypes_for_boolean_cols = {col: "Categorical" for col in X_boolean_cols}
X.ww.init(schema=original_X_schema, logical_types=new_ltypes_for_boolean_cols)
return X
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are copied from the SimpleImputer but don't rely on being a class method, could you move these into a utils file (both for this class and the SimpleImputer)? The higher level components/utils.py would probably be fine.


# Drop natural language columns and transform the other columns
X_t, natural_language_cols = self._drop_natural_language_columns(X)
if X_t.shape[-1] == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Just a few lines higher you check X.shape[1] while here you check X_t.shape[-1]. Can you flip this to be consistent?

Comment on lines +72 to +73
"numeric_impute_strategy": ["mean", "median", "most_frequent", "knn"],
"boolean_impute_strategy": ["most_frequent", "knn"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to need more tests here than just checking that knn is in the hyperparameters. At a minimum, I'd test when a dtype strategy is set to knn that the class attribute is a KNN imputer with the expected parameters.

from evalml.pipelines.components.transformers.imputers import KNNImputer


def test_knn_imputer_1_neighbor():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to test this. Fundamentally, we can rely on sklearn to have implemented n_neighbors correctly. A better thing to test here would be to check that our output matches the sklearn object's output, and we can parametrize over the number of neighbors if you're worried about that as well.
Pseudocode:

@pytest.mark.parametrize("n_neighbors", [1, 2, 5])
def test_knn_output(n_neighbors):
   X = ...
   sk_knn = SKKNN(n_neighbors)
   knn = KNN(n_neighbors)
   assert sk_knn.fit_transform(X) == knn.fit_transform(X)

assert_frame_equal(X_multi_expected_arr, X_multi_t)


def test_knn_imputer_all_bool():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is almost functionally identical to test_knn_imputer_all_bool_return_original, we don't need both

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really good! Just a few small things to fix before this goes in

Comment on lines 93 to 99
original_X_schema = X_schema.get_subset_schema(
subset_cols=X_schema._filter_cols(exclude=["BooleanNullable"]),
)
X_bool_nullable_cols = X_schema._filter_cols(include=["Categorical"])
new_ltypes_for_bool_nullable_cols = {
col: "Boolean" for col in X_bool_nullable_cols
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very confused by this and the lower code messing with boolean data types. What exactly what the goal here? It looks like you may be able to replace all this with a call to downcast_nullable_types.

Copy link
Contributor Author

@MichaelFu512 MichaelFu512 Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote the code above since honestly it confused even me reading it now. downcast_nullable_types turns type booleans into doubles which I would prefer not to have happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, downcast_nullable_types appears to turn BooleanNullable into boolean, if you were getting doubles that seems like a problem we should address. It's not relevant to this issue though, and the cleaned up code is a lot easier to read so I'm fine with it!

@@ -163,6 +163,45 @@ def handle_component_class(component_class):
return component_class


def drop_natural_language_columns(X):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! It'd be great if you could add test cases for this function as well as set_boolean_columns_to_categorical.

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! Should be good to go after some more testing changes as well as Becca's comments.

)
if boolean_impute_strategy == "knn":
self._boolean_imputer = KNNImputer(
number_neighbors=1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think thats a good idea - we should be able to up the number of neighbors. Can you file an issue for this?

imputer = KNNImputer()
imputer.fit(X, y)
X_t = imputer.transform(X)
assert_frame_equal(X_expected_arr, X_t)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also check the woodwork logical types are the same here.

"some_nan": pd.Series([True, np.nan, False, np.nan, True], dtype="boolean"),
},
)
X.ww.init(logical_types={"some_nan": "BooleanNullable"})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we only init here based off if data_type == ww? we should either init or run make_data_type later.

imputer = KNNImputer(number_neighbors=1)
X_t = imputer.fit_transform(X, y)
X_ww_expected = "<Series: some_nan (Physical Type = bool) (Logical Type = Boolean) (Semantic Tags = set())>"
assert str(X_t.ww["some_nan"].ww) == X_ww_expected
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just assert that the logical type is Boolean by indexing into X_t.ww

"bool no nan": pd.Series([False, False, False, False, True], dtype=bool),
},
)
X_multi.ww.init(logical_types={"bool with nan": "BooleanNullable"})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question about init and make_data_type here.

imputer = KNNImputer(number_neighbors=1)
imputer.fit(X_multi, y)
X_multi_t = imputer.transform(X_multi)
assert_frame_equal(X_multi_expected_arr, X_multi_t)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's assert that the logical types make sense here too.

assert_frame_equal(X_multi_expected_arr, X_multi_t)


def test_knn_imputer_revert_categorical_to_boolean():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this test to check if BooleanNullables will be converted to Boolean? If so let's parameterize this method to use pandas and ww as well and then check the logical types in the WW case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 92 was to test if I could revert categorical to boolean. However KNNImputer doesn't have the same issues that SimpleImputer has if you pass in just bools (no nulls), therefore I can get rid of line 92's test (since I don't have to actually change bool columns that are all bools into categorical).


result = imputer.transform(X_df, y)

# raise Exception
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No oops

@MichaelFu512 MichaelFu512 requested a review from eccabay September 2, 2022 16:21
Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for powering through this! Let's get this in 😎

@MichaelFu512 MichaelFu512 merged commit de675e9 into main Sep 2, 2022
@MichaelFu512 MichaelFu512 deleted the knn_imputer_implementation branch September 2, 2022 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants