-
Notifications
You must be signed in to change notification settings - Fork 91
Remove all nan to double conversion in infer_feature_types #3196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -198,6 +198,7 @@ def test_fit_transform_drop_all_nan_columns(): | |
| "another_col": [0, 1, 2], | ||
| } | ||
| ) | ||
| X.ww.init(logical_types={"all_nan": "Double"}) | ||
| strategies = { | ||
| "all_nan": {"impute_strategy": "most_frequent"}, | ||
| "some_nan": {"impute_strategy": "most_frequent"}, | ||
|
|
@@ -212,7 +213,7 @@ def test_fit_transform_drop_all_nan_columns(): | |
| pd.DataFrame( | ||
| { | ||
| "all_nan": [np.nan, np.nan, np.nan], | ||
| "some_nan": [np.nan, 1, 0], | ||
| "some_nan": [0.0, 1.0, 0.0], | ||
| "another_col": [0, 1, 2], | ||
| } | ||
| ), | ||
|
|
@@ -227,6 +228,7 @@ def test_transform_drop_all_nan_columns(): | |
| "another_col": [0, 1, 2], | ||
| } | ||
| ) | ||
| X.ww.init(logical_types={"all_nan": "Double"}) | ||
| strategies = { | ||
| "all_nan": {"impute_strategy": "most_frequent"}, | ||
| "some_nan": {"impute_strategy": "most_frequent"}, | ||
|
|
@@ -243,7 +245,7 @@ def test_transform_drop_all_nan_columns(): | |
| pd.DataFrame( | ||
| { | ||
| "all_nan": [np.nan, np.nan, np.nan], | ||
| "some_nan": [np.nan, 1, 0], | ||
| "some_nan": [0.0, 1.0, 0.0], | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same here
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same comment as above ^ |
||
| "another_col": [0, 1, 2], | ||
| } | ||
| ), | ||
|
|
@@ -255,6 +257,7 @@ def test_transform_drop_all_nan_columns_empty(): | |
| strategies = { | ||
| "0": {"impute_strategy": "most_frequent"}, | ||
| } | ||
| X.ww.init(logical_types={0: "Double", 1: "Double", 2: "Double"}) | ||
| transformer = PerColumnImputer(impute_strategies=strategies) | ||
| assert transformer.fit_transform(X).empty | ||
| assert_frame_equal(X, pd.DataFrame([[np.nan, np.nan, np.nan]])) | ||
|
|
@@ -335,6 +338,9 @@ def test_per_column_imputer_impute_all_is_false(): | |
| "column_with_nan_included": "double", | ||
| } | ||
| ) | ||
| X.ww.init( | ||
| logical_types={"all_nan_included": "Double", "all_nan_not_included": "Double"} | ||
| ) | ||
| X_t = transformer.fit_transform(X) | ||
| assert_frame_equal(X_expected, X_t) | ||
| assert_frame_equal( | ||
|
|
@@ -344,7 +350,8 @@ def test_per_column_imputer_impute_all_is_false(): | |
| "all_nan_not_included": [np.nan, np.nan, np.nan], | ||
| "all_nan_included": [np.nan, np.nan, np.nan], | ||
| "column_with_nan_not_included": [np.nan, 1, 0], | ||
| "column_with_nan_included": [0, 1, np.nan], | ||
| # Because of https://github.com/alteryx/evalml/issues/2055 | ||
| "column_with_nan_included": [0.0, 1.0, 0.0], | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. rename |
||
| } | ||
| ), | ||
| ) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -14,7 +14,6 @@ | |
| from evalml.pipelines.components import ( | ||
| DateTimeFeaturizer, | ||
| DropColumns, | ||
| DropNullColumns, | ||
| DropRowsTransformer, | ||
| EmailFeaturizer, | ||
| Estimator, | ||
|
|
@@ -131,7 +130,7 @@ def test_make_pipeline( | |
| if estimator_class.model_family == ModelFamily.LINEAR_MODEL | ||
| else [] | ||
| ) | ||
| drop_null = [DropNullColumns] if "all_null" in column_names else [] | ||
| drop_null = [DropColumns] if "all_null" in column_names else [] | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change is because all_null will be considered an unknown column now, rather than double right?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes! |
||
| replace_null = ( | ||
| [ReplaceNullableTypes] | ||
| if ( | ||
|
|
@@ -147,7 +146,7 @@ def test_make_pipeline( | |
| ) | ||
| email_featurizer = [EmailFeaturizer] if "email" in column_names else [] | ||
| url_featurizer = [URLFeaturizer] if "url" in column_names else [] | ||
| imputer = [] if (column_names in [["ip"]]) else [Imputer] | ||
| imputer = [] if (column_names in [["ip"], ["all_null"]]) else [Imputer] | ||
|
|
||
| if is_time_series(problem_type): | ||
| expected_components = ( | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this need to change? Also, we should rename this column since there are no longer nans
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I thought I had answered this yesterday.
The only way to pass an all-NaN column to the imputer is if it's not typed as
Unknown, which is what the default woodwork inference returns. So we need to init woodwork on the dataframe. However, because of #2055, the imputer will modify the input data.So the
some_nancolumn will have the NaNs imputed. We can't change the name of the column because the intention is to compare the input data before and after running through the imputer and the column name doesn't change after going through the imputer.