Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to support Woodwork 0.5.1 #2610

Merged
merged 22 commits into from
Aug 12, 2021
Merged

Updates to support Woodwork 0.5.1 #2610

merged 22 commits into from
Aug 12, 2021

Conversation

chukarsten
Copy link
Contributor

@chukarsten chukarsten commented Aug 9, 2021

Fixes #2543

@codecov
Copy link

codecov bot commented Aug 9, 2021

Codecov Report

Merging #2610 (4a4af75) into main (4eee441) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2610     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        297     297             
  Lines      27033   27071     +38     
=======================================
+ Hits       26989   27027     +38     
  Misses        44      44             
Impacted Files Coverage Δ
evalml/data_checks/invalid_targets_data_check.py 100.0% <ø> (ø)
...ta_checks_tests/test_invalid_targets_data_check.py 100.0% <ø> (ø)
...components/transformers/imputers/target_imputer.py 100.0% <100.0%> (ø)
evalml/pipelines/utils.py 99.2% <100.0%> (ø)
evalml/tests/component_tests/test_components.py 100.0% <100.0%> (ø)
...valml/tests/component_tests/test_simple_imputer.py 100.0% <100.0%> (ø)
...valml/tests/component_tests/test_target_imputer.py 100.0% <100.0%> (ø)
evalml/tests/pipeline_tests/test_pipeline_utils.py 100.0% <100.0%> (ø)
evalml/tests/utils_tests/test_woodwork_utils.py 100.0% <100.0%> (ø)
evalml/utils/woodwork_utils.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4eee441...4a4af75. Read the comment docs.

@chukarsten chukarsten changed the title Ww 051 updates Updates to support Woodwork 0.5.1 Aug 9, 2021
Copy link
Contributor

@thehomebrewnerd thehomebrewnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was a bit curious about the updates required for WW so took a quick look at this. Noticed a couple things that I thought I'd mention. Feel free to ignore or act upon as you see fit.

evalml/pipelines/utils.py Outdated Show resolved Hide resolved
evalml/pipelines/utils.py Outdated Show resolved Hide resolved
@chukarsten chukarsten marked this pull request as ready for review August 10, 2021 18:35
@@ -76,7 +76,10 @@ def fit(self, X, y):
"""
if y is None:
return self
y = infer_feature_types(y).to_frame()
y = infer_feature_types(y)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the exception for the target imputer to fit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, but couldn't you do

y = infer_feature_types(y).to_frame()
if all(y.isnull()):
        raise TypeError("Provided target full of nulls.")

just to shorten/simplify slightly?

@@ -973,7 +973,7 @@ def fit(self, X, y):
return self

def predict(self, X):
series = pd.Series()
series = pd.Series(dtype="string")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change was to accommodate the way empty series are now inferred. Woodwork complains if you don't do this.

@@ -72,15 +72,17 @@ def test_some_missing_col_names(text_df, caplog):
}


def test_lsa_empty_text_column():
X = pd.DataFrame({"col_1": []})
@pytest.mark.parametrize(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to expand the coverage here to cover other commonly "empty" columns. Also, the TextFeaturizer and LSA component have code that seems to expect tolerance of an empty column.

e.g.

    def fit(self, X, y=None):
        X = infer_feature_types(X)
        self._text_columns = self._get_text_columns(X)

        if len(self._text_columns) == 0:
            return self

I thought this was strange that we would pass through an empty column and expect SKlearn to return the original ValueError, but also have code here that seemingly accounts for the behavior of what LSA (and TextFeaturizer) should do upon receiving an empty column. I wonder if, perhaps, the original intent was to account for two distincy cases of 1.) empty columns whose type is known as a string or natural language and 2.) an empty column whose type is unknown. Be happy to hear additional input here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think the purpose of X = infer_feature_types(X, {"col_1": "NaturalLanguage"}) was to verify that even an empty column whose type is NaturalLanguage will be identified by self._get_text_columns(X) and will not return self immediately. And since transform looks for that as well

        if len(self._text_columns) == 0:
            return X_ww

sklearn would be called to transform the features and raise an error. If X = infer_feature_types(X, {"col_1": "NaturalLanguage"}) is removed, no text features are recognized and self is returned immediately.

Can't speak to the original reasoning but it looks like those are the cases being presented.

@@ -64,8 +64,8 @@ def _get_test_data_from_configuration(
"abalone_0@gmail.com",
"AbaloneRings@yahoo.com",
"abalone_2@abalone.com",
"$titanic_data%&@hotmail.com",
"foo*EMAIL@email.org",
"titanic_data@hotmail.com",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submitted this issue to Woodwork to cover these email addresses which slipped through the WW EmailAddress inference. @davesque since I saw you did the Email inference.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chukarsten Yeah, never seen email addresses like that before :). I think it's safe to delete them from test data to accommodate the woodwork update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I think alteryx/woodwork#1080 will help cover against users passing impossible email values by manually specifying the email type.

if "email" in column_names and input_type == "ww"
else []
)
email_featurizer = [EmailFeaturizer] if "email" in column_names else []
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the change required for Email inference in WW.

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to go, thanks Karsten! Agreed w/ Freddy on the double DropColumn, I think we're set on AutoML side but it's maybe worth discussing outside this context why we have AutoML do this in the first place :P

evalml/pipelines/utils.py Outdated Show resolved Hide resolved
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I left a few questions/nits, but agreed with @freddyaboulton that we should try not to append 2 DropColumn components to the pipeline, especially since it's likely we only set 1 of them.

evalml/data_checks/invalid_targets_data_check.py Outdated Show resolved Hide resolved
@@ -76,7 +76,10 @@ def fit(self, X, y):
"""
if y is None:
return self
y = infer_feature_types(y).to_frame()
y = infer_feature_types(y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, but couldn't you do

y = infer_feature_types(y).to_frame()
if all(y.isnull()):
        raise TypeError("Provided target full of nulls.")

just to shorten/simplify slightly?

evalml/pipelines/utils.py Outdated Show resolved Hide resolved
return ww.init_series(data, logical_type=feature_types)
else:
ww_data = data.copy()
# Revert the inference of all nulls to the unknown type and change it back to double.
all_null_cols = ww_data.columns[ww_data.isnull().all(0)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work if ww is initialized before being passed into one of our components?

image

I think this might also cause a problem with partial dependence but I have not verified. Down to talk about it after stand-up!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I handled that specific case. Feel free to check it out. I was considering maybe adding a similar test for the target imputer with the y series being pre-inited.

…ll columns as null columns are now inferred to Unknown type.
… for text featurizers to realize they have empty columns and adopt that behavior.
… to accomodate the new check for Unknown in get_pp_components. Made Email get treated properly in testing as WW should infer it properly now. Made infer_feature_types replace all pd.NA with np.nan for series as well as dataframes.
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work on this @chukarsten !!

evalml/utils/woodwork_utils.py Show resolved Hide resolved
@@ -64,8 +64,8 @@ def _get_test_data_from_configuration(
"abalone_0@gmail.com",
"AbaloneRings@yahoo.com",
"abalone_2@abalone.com",
"$titanic_data%&@hotmail.com",
"foo*EMAIL@email.org",
"titanic_data@hotmail.com",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I think alteryx/woodwork#1080 will help cover against users passing impossible email values by manually specifying the email type.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like the new test! left one comment on a typo, but LGTM

),
)
def test_infer_feature_types_NA_to_nan(null_col, already_inited):
"""A short test to make sure that columnds with all null values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: columns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

classic :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade to Woodwork 0.5.1
7 participants