-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text featurizer: refactor and tweak featuretools usage #1090
Conversation
…riable type validation back in.
text_columns = self._get_text_columns(X) | ||
corpus = X[text_columns].values.flatten() | ||
# we assume non-str values will have been filtered out prior to calling LSA.fit. this is a safeguard. | ||
corpus = corpus.astype(str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bugfix--if you pass non-str values into LSA it errors out in the call to pipeline fit
below, unless you convert all the columns to string type first. This means numbers, None, nan, will all get represented as strings.
A future improvement could be to replace None/nan with an empty string, or to simply remove them, to avoid polluting the TF-IDF corpus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But theoretically Nones/nans were filtered out by the imputer earlier in the evalml pipeline
X[text_col] = X[text_col].apply(normalize) | ||
for col_name in X.columns: | ||
# we assume non-str values will have been filtered out prior to calling TextFeaturizer. casting to str is a safeguard. | ||
col = X[col_name].astype(str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above--if there are non-str values in the feature, we should convert to str first to avoid stack trace when translate
is called in normalize
.
return X | ||
|
||
def _verify_col_names(self, col_names): | ||
missing_cols = [col for col in self._text_col_names if col not in col_names] | ||
def _make_entity_set(self, X, text_columns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now fit and transform use shared code to create the entity set
Codecov Report
@@ Coverage Diff @@
## main #1090 +/- ##
==========================================
+ Coverage 93.24% 99.89% +6.65%
==========================================
Files 191 192 +1
Lines 10852 10853 +1
==========================================
+ Hits 10119 10842 +723
+ Misses 733 11 -722
Continue to review full report at Codecov.
|
|
||
es = self._ft.EntitySet() | ||
es.entity_from_dataframe(entity_id='X', dataframe=X_text, index='index', make_index=True, | ||
variable_types=all_text_variable_types) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roy had two suggestions: 1) use make_index=True
instead of creating an index and adding it to the DF, and 2) pass in the desired variable types. I don't think either change the behavior here.
One detail is that both before and after this PR, the index col gets created and added to the dataframe in the entity set. I think that's fine; we should just make sure the index col doesn't appear in the output of transform
. I'm pretty sure calculate_feature_matrix
filters that out though.
self._features = self._ft.dfs(entityset=es, | ||
target_entity='X', | ||
trans_primitives=trans, | ||
trans_primitives=self._trans, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defining these in __init__
means we now have a private API where you can init the component, then override the primitives used, then call fit
/transform
. May be useful for quick testing internally.
es = self._make_entity_set(X, text_columns) | ||
X_nlp_primitives = self._ft.calculate_feature_matrix(features=self._features, entityset=es) | ||
if X_nlp_primitives.isnull().any().any(): | ||
X_nlp_primitives.fillna(0, inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I lifted these two lines from your branch for #908
raise AttributeError("None of the provided text column names match the columns in the given DataFrame") | ||
if len(columns) < len(self._all_text_columns): | ||
logger.warn("Columns {} were not found in the given DataFrame, ignoring".format(missing_columns)) | ||
return columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this method has no side effects. It computes and returns a value; it may also error/warn, but it doesn't modify any of the state attached to self
. I think this makes it easier for a new reader to understand what _get_text_columns
does while skimming through the code.
The extra mile would be to make _get_text_columns
a staticmethod... but I was too lazy 🤷♂️ 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for doing this! I'll run some more performance tests once this is merged.
Hey @eccabay , I ended up chatting with @rwedge later yesterday and he gave me some pointers on how we can use featuretools. I added those changes. I don't think much has changed functionally--I added text variable types as an input and tweaked how the featuretools index is created.
I also made some simplifications like adding a common base class for text transformers to avoid duplicating code, refactored the helpers to share code for creating the featuretools entity, avoid mixing validation and side effects, and added one more LSA test.
I don't think this will change the performance results you've been getting, but hopefully it'll make it a bit easier to track that down.