Conversation
b641db8 to
0d98715
Compare
Codecov Report
@@ Coverage Diff @@
## main #2610 +/- ##
=======================================
+ Coverage 99.9% 99.9% +0.1%
=======================================
Files 297 297
Lines 27033 27071 +38
=======================================
+ Hits 26989 27027 +38
Misses 44 44
Continue to review full report at Codecov.
|
thehomebrewnerd
left a comment
There was a problem hiding this comment.
Was a bit curious about the updates required for WW so took a quick look at this. Noticed a couple things that I thought I'd mention. Feel free to ignore or act upon as you see fit.
e18f41b to
004f752
Compare
| if y is None: | ||
| return self | ||
| y = infer_feature_types(y).to_frame() | ||
| y = infer_feature_types(y) |
There was a problem hiding this comment.
Moved the exception for the target imputer to fit.
There was a problem hiding this comment.
nit, but couldn't you do
y = infer_feature_types(y).to_frame()
if all(y.isnull()):
raise TypeError("Provided target full of nulls.")just to shorten/simplify slightly?
|
|
||
| def predict(self, X): | ||
| series = pd.Series() | ||
| series = pd.Series(dtype="string") |
There was a problem hiding this comment.
I think this change was to accommodate the way empty series are now inferred. Woodwork complains if you don't do this.
|
|
||
| def test_lsa_empty_text_column(): | ||
| X = pd.DataFrame({"col_1": []}) | ||
| @pytest.mark.parametrize( |
There was a problem hiding this comment.
Tried to expand the coverage here to cover other commonly "empty" columns. Also, the TextFeaturizer and LSA component have code that seems to expect tolerance of an empty column.
e.g.
def fit(self, X, y=None):
X = infer_feature_types(X)
self._text_columns = self._get_text_columns(X)
if len(self._text_columns) == 0:
return self
I thought this was strange that we would pass through an empty column and expect SKlearn to return the original ValueError, but also have code here that seemingly accounts for the behavior of what LSA (and TextFeaturizer) should do upon receiving an empty column. I wonder if, perhaps, the original intent was to account for two distincy cases of 1.) empty columns whose type is known as a string or natural language and 2.) an empty column whose type is unknown. Be happy to hear additional input here.
There was a problem hiding this comment.
Yeah I think the purpose of X = infer_feature_types(X, {"col_1": "NaturalLanguage"}) was to verify that even an empty column whose type is NaturalLanguage will be identified by self._get_text_columns(X) and will not return self immediately. And since transform looks for that as well
if len(self._text_columns) == 0:
return X_ww
sklearn would be called to transform the features and raise an error. If X = infer_feature_types(X, {"col_1": "NaturalLanguage"}) is removed, no text features are recognized and self is returned immediately.
Can't speak to the original reasoning but it looks like those are the cases being presented.
| "abalone_2@abalone.com", | ||
| "$titanic_data%&@hotmail.com", | ||
| "foo*EMAIL@email.org", | ||
| "titanic_data@hotmail.com", |
There was a problem hiding this comment.
I submitted this issue to Woodwork to cover these email addresses which slipped through the WW EmailAddress inference. @davesque since I saw you did the Email inference.
There was a problem hiding this comment.
@chukarsten Yeah, never seen email addresses like that before :). I think it's safe to delete them from test data to accommodate the woodwork update.
There was a problem hiding this comment.
Agreed. I think alteryx/woodwork#1080 will help cover against users passing impossible email values by manually specifying the email type.
| if "email" in column_names and input_type == "ww" | ||
| else [] | ||
| ) | ||
| email_featurizer = [EmailFeaturizer] if "email" in column_names else [] |
There was a problem hiding this comment.
This is the change required for Email inference in WW.
angela97lin
left a comment
There was a problem hiding this comment.
I think this is good to go, thanks Karsten! Agreed w/ Freddy on the double DropColumn, I think we're set on AutoML side but it's maybe worth discussing outside this context why we have AutoML do this in the first place :P
bchen1116
left a comment
There was a problem hiding this comment.
Looking good! I left a few questions/nits, but agreed with @freddyaboulton that we should try not to append 2 DropColumn components to the pipeline, especially since it's likely we only set 1 of them.
| if y is None: | ||
| return self | ||
| y = infer_feature_types(y).to_frame() | ||
| y = infer_feature_types(y) |
There was a problem hiding this comment.
nit, but couldn't you do
y = infer_feature_types(y).to_frame()
if all(y.isnull()):
raise TypeError("Provided target full of nulls.")just to shorten/simplify slightly?
evalml/utils/woodwork_utils.py
Outdated
| else: | ||
| ww_data = data.copy() | ||
| # Revert the inference of all nulls to the unknown type and change it back to double. | ||
| all_null_cols = ww_data.columns[ww_data.isnull().all(0)] |
There was a problem hiding this comment.
I handled that specific case. Feel free to check it out. I was considering maybe adding a similar test for the target imputer with the y series being pre-inited.
15f5c1e to
6da3e03
Compare
…ll columns as null columns are now inferred to Unknown type.
… for text featurizers to realize they have empty columns and adopt that behavior.
… to accomodate the new check for Unknown in get_pp_components. Made Email get treated properly in testing as WW should infer it properly now. Made infer_feature_types replace all pd.NA with np.nan for series as well as dataframes.
…n infer_feature_types.
…ction in infer_feature_types.
6da3e03 to
f757ba4
Compare
freddyaboulton
left a comment
There was a problem hiding this comment.
Thank you for your work on this @chukarsten !!
| "abalone_2@abalone.com", | ||
| "$titanic_data%&@hotmail.com", | ||
| "foo*EMAIL@email.org", | ||
| "titanic_data@hotmail.com", |
There was a problem hiding this comment.
Agreed. I think alteryx/woodwork#1080 will help cover against users passing impossible email values by manually specifying the email type.
bchen1116
left a comment
There was a problem hiding this comment.
Nice! I like the new test! left one comment on a typo, but LGTM
| ), | ||
| ) | ||
| def test_infer_feature_types_NA_to_nan(null_col, already_inited): | ||
| """A short test to make sure that columnds with all null values |

Fixes #2543