Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text featurizer: refactor and tweak featuretools usage #1090

Merged
merged 11 commits into from
Aug 21, 2020

Conversation

dsherry
Copy link
Contributor

@dsherry dsherry commented Aug 21, 2020

Hey @eccabay , I ended up chatting with @rwedge later yesterday and he gave me some pointers on how we can use featuretools. I added those changes. I don't think much has changed functionally--I added text variable types as an input and tweaked how the featuretools index is created.

I also made some simplifications like adding a common base class for text transformers to avoid duplicating code, refactored the helpers to share code for creating the featuretools entity, avoid mixing validation and side effects, and added one more LSA test.

I don't think this will change the performance results you've been getting, but hopefully it'll make it a bit easier to track that down.

@dsherry dsherry changed the title Text featurizer: update featuretools Text featurizer: refactor and tweak featuretools usage Aug 21, 2020
text_columns = self._get_text_columns(X)
corpus = X[text_columns].values.flatten()
# we assume non-str values will have been filtered out prior to calling LSA.fit. this is a safeguard.
corpus = corpus.astype(str)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bugfix--if you pass non-str values into LSA it errors out in the call to pipeline fit below, unless you convert all the columns to string type first. This means numbers, None, nan, will all get represented as strings.

A future improvement could be to replace None/nan with an empty string, or to simply remove them, to avoid polluting the TF-IDF corpus.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But theoretically Nones/nans were filtered out by the imputer earlier in the evalml pipeline

X[text_col] = X[text_col].apply(normalize)
for col_name in X.columns:
# we assume non-str values will have been filtered out prior to calling TextFeaturizer. casting to str is a safeguard.
col = X[col_name].astype(str)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above--if there are non-str values in the feature, we should convert to str first to avoid stack trace when translate is called in normalize.

return X

def _verify_col_names(self, col_names):
missing_cols = [col for col in self._text_col_names if col not in col_names]
def _make_entity_set(self, X, text_columns):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now fit and transform use shared code to create the entity set

@codecov
Copy link

codecov bot commented Aug 21, 2020

Codecov Report

Merging #1090 into main will increase coverage by 6.65%.
The diff coverage is 97.40%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1090      +/-   ##
==========================================
+ Coverage   93.24%   99.89%   +6.65%     
==========================================
  Files         191      192       +1     
  Lines       10852    10853       +1     
==========================================
+ Hits        10119    10842     +723     
+ Misses        733       11     -722     
Impacted Files Coverage Δ
...ediction_explanations_tests/test_user_interface.py 100.00% <ø> (ø)
...ents/transformers/preprocessing/text_featurizer.py 96.42% <93.33%> (-3.58%) ⬇️
.../components/transformers/preprocessing/__init__.py 100.00% <100.00%> (ø)
...lines/components/transformers/preprocessing/lsa.py 100.00% <100.00%> (ø)
...nts/transformers/preprocessing/text_transformer.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_lsa.py 100.00% <100.00%> (ø)
...alml/tests/component_tests/test_text_featurizer.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_components.py 99.60% <0.00%> (+0.19%) ⬆️
evalml/automl/automl_search.py 99.55% <0.00%> (+0.44%) ⬆️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5302952...820bca4. Read the comment docs.


es = self._ft.EntitySet()
es.entity_from_dataframe(entity_id='X', dataframe=X_text, index='index', make_index=True,
variable_types=all_text_variable_types)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roy had two suggestions: 1) use make_index=True instead of creating an index and adding it to the DF, and 2) pass in the desired variable types. I don't think either change the behavior here.

One detail is that both before and after this PR, the index col gets created and added to the dataframe in the entity set. I think that's fine; we should just make sure the index col doesn't appear in the output of transform. I'm pretty sure calculate_feature_matrix filters that out though.

self._features = self._ft.dfs(entityset=es,
target_entity='X',
trans_primitives=trans,
trans_primitives=self._trans,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining these in __init__ means we now have a private API where you can init the component, then override the primitives used, then call fit/transform. May be useful for quick testing internally.

es = self._make_entity_set(X, text_columns)
X_nlp_primitives = self._ft.calculate_feature_matrix(features=self._features, entityset=es)
if X_nlp_primitives.isnull().any().any():
X_nlp_primitives.fillna(0, inplace=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I lifted these two lines from your branch for #908

raise AttributeError("None of the provided text column names match the columns in the given DataFrame")
if len(columns) < len(self._all_text_columns):
logger.warn("Columns {} were not found in the given DataFrame, ignoring".format(missing_columns))
return columns
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this method has no side effects. It computes and returns a value; it may also error/warn, but it doesn't modify any of the state attached to self. I think this makes it easier for a new reader to understand what _get_text_columns does while skimming through the code.

The extra mile would be to make _get_text_columns a staticmethod... but I was too lazy 🤷‍♂️ 😂

@dsherry dsherry marked this pull request as ready for review August 21, 2020 12:58
Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for doing this! I'll run some more performance tests once this is merged.

@dsherry dsherry merged commit cbe513f into main Aug 21, 2020
@dsherry dsherry added this to the August 2020 milestone Aug 21, 2020
@eccabay
Copy link
Contributor

eccabay commented Aug 24, 2020

@dsherry performance testing results are here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants