Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address NaN values in Text Featurizer #2532

Merged
merged 12 commits into from
Jul 26, 2021
Merged

Address NaN values in Text Featurizer #2532

merged 12 commits into from
Jul 26, 2021

Conversation

bchen1116
Copy link
Contributor

fix #1587

@bchen1116 bchen1116 self-assigned this Jul 20, 2021
@codecov
Copy link

codecov bot commented Jul 20, 2021

Codecov Report

Merging #2532 (1c4effc) into main (9c8acf7) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2532     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        285     285             
  Lines      26168   26203     +35     
=======================================
+ Hits       26132   26167     +35     
  Misses        36      36             
Impacted Files Coverage Δ
...ents/transformers/preprocessing/text_featurizer.py 100.0% <100.0%> (ø)
evalml/pipelines/utils.py 99.1% <100.0%> (ø)
...alml/tests/component_tests/test_text_featurizer.py 100.0% <100.0%> (ø)
...valml/tests/pipeline_tests/test_component_graph.py 100.0% <100.0%> (ø)
evalml/tests/pipeline_tests/test_pipeline_utils.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9c8acf7...1c4effc. Read the comment docs.

X_lsa = self._lsa.transform(X_ww.ww[self._text_columns])
X_ww_altered = infer_feature_types(
X_ww.ww[self._text_columns].fillna(""),
{s: "NaturalLanguage" for s in self._text_columns},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use this to keep the natural language type after filling in the nans with empty strings

for col in X_nlp_primitives:
X_ww.ww[col] = X_nlp_primitives[col]
for col in X_lsa:
X_ww.ww[col] = X_lsa[col]

if X.isna().any().any():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only convert empty to nans if there were nans in the original data

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you take the simplification per Freddy's request, I feel pretty good about this! Nice work.

@@ -362,10 +362,10 @@ def test_make_pipeline_text_columns(input_type, problem_type):
else:
estimator_components = [OneHotEncoder, estimator_class]
if estimator_class.model_family == ModelFamily.ARIMA:
expected_components = [Imputer, TextFeaturizer] + estimator_components
expected_components = [TextFeaturizer, Imputer] + estimator_components
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this order change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chukarsten since we are now passing through NaN values in the TextFeaturizer, we need the imputer to come after it to impute those values

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 This looks good to me! Thank you for making the changes!

@bchen1116 bchen1116 merged commit 23f5d9b into main Jul 26, 2021
@chukarsten chukarsten mentioned this pull request Aug 3, 2021
@freddyaboulton freddyaboulton deleted the bc_1587_nlp_nan branch May 13, 2022 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Impute NaNs for natural language features
3 participants