Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for NaN types for NLP in AutoMLSearch #2577

Merged
merged 7 commits into from Aug 3, 2021
Merged

Support for NaN types for NLP in AutoMLSearch #2577

merged 7 commits into from Aug 3, 2021

Conversation

bchen1116
Copy link
Contributor

close #2515

Added some support to make sure missing natural language types are supported in AutoMLSearch. We change pd.NA to None to ensure we can preserve the float64 type that we expect from the text featurizer.

@bchen1116 bchen1116 self-assigned this Aug 2, 2021
@codecov
Copy link

codecov bot commented Aug 2, 2021

Codecov Report

Merging #2577 (67571d7) into main (d886206) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2577     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        293     293             
  Lines      26881   26894     +13     
=======================================
+ Hits       26838   26851     +13     
  Misses        43      43             
Impacted Files Coverage Δ
...ents/transformers/preprocessing/text_featurizer.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 99.7% <100.0%> (+0.1%) ⬆️
...alml/tests/component_tests/test_text_featurizer.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d886206...67571d7. Read the comment docs.

@@ -141,17 +140,17 @@ def transform(self, X, y=None):
{s: "NaturalLanguage" for s in self._text_columns},
)
X_lsa = self._lsa.transform(X_ww_altered)
X_nlp_primitives.set_index(X_ww.index, inplace=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reset the index to be equal to the input X

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why is this necessary for this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During AutoMLSearch, when we do the data splits, we get input indices that aren't in consecutive order. Seems like X_nlp_primitives completely resets the index (so it becomes 0, 1, 2, ...), which causes errors when we try to set the None values using the mask. I just moved this line from below to above the for loop so that the .loc method doesn't fail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to do it for the lsa features because they are computed by the lsa component, which preserves the index because our components preserve indices!!

X_nlp_primitives.set_index(X_ww.index, inplace=True)

X_lsa.loc[nan_mask[column], derived_features] = None
X_lsa = X_lsa.astype("float64")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enforce that the types are 'float64', in case pandas decides to treat None similarly to pd.NA or np.nan. The latter two cause the dtypes to become objects (and thus treats these columns as categorical)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should use X_lsa.ww.init(logical_types={col: "Double" for col in X_lsa.columns}). The reason is that if we don't set a ww type explicitly, ww will still infer types for the features in X_lsa. Right now, that inference works in our favor because ww uses pd.dtypes.is_float_dtype but maybe that'll change in the future. I think the safer thing to do here is to set an explicit woodwork logical type and avoid the type inference.

@bchen1116 bchen1116 requested a review from a team August 2, 2021 18:57
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bchen1116 ! I left a suggestion for test_search_text_with_nans but otherwise this looks fantastic!

X.ww.init(logical_types={"b": "NaturalLanguage"})
y = pd.Series([0] * 25 + [1] * 75)

automl = AutoMLSearch(X_train=X, y_train=y, problem_type="binary", max_iterations=2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the follow-up bug that we encountered was that the types were being set to categorical by woodwork and only the LightGBMClassifier failed because of it right?.

I think it may be better to use the AutoMLSearchEnv and check all of the woodwork types are only Double/Integer in mock_fit. I see we also check that in test_text_featurizer.py now but if we're going to add an AutoMLSearch test, I think it's worth it to run a full batch and check what we're after with mocks. Correct me if I'm wrong but this test wouldn't catch the LGBM problem right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point! I'll patch that

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Just had a question about why we need to reset index--if it's related to this PR or if it's something you noticed but LGTM 😁

@@ -141,17 +140,17 @@ def transform(self, X, y=None):
{s: "NaturalLanguage" for s in self._text_columns},
)
X_lsa = self._lsa.transform(X_ww_altered)
X_nlp_primitives.set_index(X_ww.index, inplace=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why is this necessary for this PR?

@bchen1116 bchen1116 merged commit 8d6b7ec into main Aug 3, 2021
@chukarsten chukarsten mentioned this pull request Aug 3, 2021
@freddyaboulton freddyaboulton deleted the bc_2515_nlp branch May 13, 2022 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for missing data in Natural Language types
3 participants