Conversation
e1ea71a to
5be5dca
Compare
Codecov Report
@@ Coverage Diff @@
## main #2783 +/- ##
=====================================
Coverage 99.8% 99.8%
=====================================
Files 297 297
Lines 27718 27718
=====================================
Hits 27650 27650
Misses 68 68
Continue to review full report at Codecov.
|
7a7dbc0 to
0bec5c2
Compare
evalml/pipelines/utils.py
Outdated
| categorical_cols = list( | ||
| X.ww.select(["category", "URL", "EmailAddress"], return_schema=True).columns | ||
| X.ww.select( | ||
| ["category", "URL", "EmailAddress", "IPAddress"], return_schema=True |
There was a problem hiding this comment.
@chukarsten How come we need to make this change? AutoMLSearch and our pipelines can't handle the IPAddress logical type yet.
9cf7ea4 to
8edb3e6
Compare
| col: "Categorical" | ||
| for col, ltype in X.ww.logical_types.items() | ||
| if isinstance(ltype, NaturalLanguage) | ||
| if isinstance(ltype, (NaturalLanguage, EmailAddress, IPAddress, URL)) |
There was a problem hiding this comment.
So the difference between WW 0.7.x and 0.8.0 is now that IPAddresses and URLs are inferred automatically. WW 0.5.1 introduced Email inference. Since these are ultimately strings, I wanted to add some support in the SimpleImputer for these types. I think they should probably be treated as categorical columns and thus able to be filled with a placeholder value.
| @@ -130,7 +131,9 @@ def _get_preprocessing_components( | |||
|
|
|||
| # The URL and EmailAddress Featurizers will create categorical columns | |||
| categorical_cols = list( | |||
There was a problem hiding this comment.
Just updating this to accommodate the IPAddress.
| @@ -350,11 +350,24 @@ def test_simple_imputer_with_none(): | |||
| assert_frame_equal(expected, transformed, check_dtype=False) | |||
|
|
|||
|
|
|||
There was a problem hiding this comment.
Modifying this test to support IPAddress, EmailAddress and URLs.
| @@ -220,7 +229,7 @@ def test_make_pipeline( | |||
| ) | |||
| drop_col = ( | |||
| [DropColumns] | |||
There was a problem hiding this comment.
This is because URLs are now properly being inferred so pandas dataframes with them are inferred properly.
| @@ -209,9 +220,7 @@ def test_make_pipeline( | |||
| else [] | |||
| ) | |||
| email_featurizer = [EmailFeaturizer] if "email" in column_names else [] | |||
There was a problem hiding this comment.
This didn't seem to be necessary.
| @@ -177,10 +189,9 @@ def test_make_pipeline( | |||
| [OneHotEncoder] | |||
| if estimator_class.model_family != ModelFamily.CATBOOST | |||
| and ( | |||
There was a problem hiding this comment.
This wasn't necessary but combined it for simplicity.
| @@ -83,6 +83,16 @@ def _get_test_data_from_configuration( | |||
| "https://github.com/alteryx/featuretools", | |||
| ] | |||
| * 2, | |||
There was a problem hiding this comment.
I'm just going to leave these as they don't hurt.
freddyaboulton
left a comment
There was a problem hiding this comment.
@chukarsten Looks good! We can fix build_conda_pkg by upping the max ww version after you merge! Thank you!!
5336256 to
bc98e08
Compare
Addresses: #2772