-
Notifications
You must be signed in to change notification settings - Fork 89
Integrate TextFeaturizer to automl #1062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1062 +/- ##
==========================================
+ Coverage 99.95% 99.95% +0.01%
==========================================
Files 213 213
Lines 13835 13857 +22
==========================================
+ Hits 13828 13850 +22
Misses 7 7
Continue to review full report at Codecov.
|
@@ -389,9 +386,14 @@ def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_pl | |||
X = pd.DataFrame(X) | |||
X = ww.DataTable(X) | |||
|
|||
text_column_vals = X.select('natural_language') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This produces a warning if there are no text columns in X
, open to any suggestions on how to suppress that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I'll file that in woodwork. I don't think there should be a warning in this case; we can discuss on that ticket.
For this PR, you can add:
warnings.filterwarnings('ignore', 'The following selectors were not present in your DataTable')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would this work?
from woodwork.logical_types import NaturalLanguage
[col_name for col_name, ltype in X.logical_types.items() if ltype == NaturalLanguage]
@gsheni what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed alteryx/woodwork#322 for this warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that should work. We may fix this due to the issue Dylan raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton oops I had written back but didn't submit my comments until after you commented 😆
I think we can just suppress the warning, yeah?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay Looks good to me!
@@ -438,6 +440,7 @@ def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_pl | |||
max_iterations=self.max_iterations, | |||
allowed_pipelines=self.allowed_pipelines, | |||
tuner_class=self.tuner_class, | |||
text_columns=text_columns, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to make this change? Can't the logic for identifying the natural language columns be delegated to the fit method of the text transformers (if none are passed into init)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton you're right, we could delegate this to the fit method of the text transformers. However, I think this is better. This PR as it stands will use the woodwork datatable to detect and select text features. If users don't like the default selections, they can override and pass a new datatable into search
. But if we had the text featurizer do the detection, users wouldn't be able to override as easily. This is one of the big value adds of the datatables for us--it provides users with a clear way to control how automl search treats each feature, and to change the behavior!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long-term, we'd like all pipelines, components and utils to be able to accept woodwork datatables (#1288). At that point, we can delete this text_columns
plumbing in favor of the datatables
}) | ||
y = [0, 1, 1, 0, 1, 0] | ||
automl = AutoMLSearch(problem_type='binary') | ||
automl.search(X, y, data_checks='disabled') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a comment explaining why the data checks are disabled? Is it because there isn't enough data for 3-fold cv?
@@ -30,6 +31,16 @@ def test_iterative_algorithm_allowed_pipelines(logistic_regression_binary_pipeli | |||
assert algo.allowed_pipelines == allowed_pipelines | |||
|
|||
|
|||
class MockEstimator(Estimator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a fixture for a mock classifier called dummy_classifier_estimator_class
would that work in this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay this is🔥 ! Really great test coverage.
All that's blocking IMO is:
- I left a suggestion for how to suppress the woodwork warning in automl search text feature selection.
- I left one question about
test_iterative_algorithm_instantiates_text
- Left comment about adding
feature_types
to breaking changes - Perf test results
woodwork>=0.0.3 | ||
featuretools>=0.20.0 | ||
nlp-primitives>=1.1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, nice!!
For others reading, featuretools appears to have pretty good backwards compatibility for dfs
, so if we get a feature request to lower this, we can certainly do that. I bet the same is true with nlp-primitives, but I think its the right call to fix at 1.1.0, because anything at/after that version will have our new treatment for including the nltk corpuses during installation, rather than downloading at first run.
docs/source/release_notes.rst
Outdated
@@ -14,6 +14,7 @@ Release Notes | |||
* Updated ``AutoMLSearch`` to support ``Woodwork`` data structures :pr:`1299` | |||
* Added cv_folds to ``ClassImbalanceDataCheck`` and added this check to ``DefaultDataChecks`` :pr:`1333` | |||
* Make ``max_batches`` argument to ``AutoMLSearch.search`` public :pr:`1320` | |||
* Included TextFeaturizer in options for automl search :pr:`1062` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! To make this easier for users to understand, could you say "Added text support to automl search"?
@@ -111,6 +113,10 @@ def _transform_parameters(self, pipeline_class, proposed_parameters): | |||
component_parameters = proposed_parameters.get(component_class.name, {}) | |||
init_params = inspect.signature(component_class.__init__).parameters | |||
|
|||
# Add the text columns parameter if the component is a TextFeaturizer | |||
if component_class.name == "Text Featurization Component": | |||
component_parameters['text_columns'] = self._text_columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
I'm hoping that we can remove this pathway in #1229 , when we add support for woodwork to pipelines and components, because at that point the text featurizer can access the woodwork datatable directly!
@@ -353,17 +353,14 @@ def _set_data_split(self, X): | |||
|
|||
self.data_split = self.data_split or default_data_split | |||
|
|||
def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_plot=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great call! This is totally dead. I genuinely don't know why we didn't delete it sooner 😆
Because its changing the method signature of search
, just for completeness, let's add an entry to the breaking changes, like
Removed unused argument
feature_types
fromAutoMLSearch.search
@@ -389,9 +386,14 @@ def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_pl | |||
X = pd.DataFrame(X) | |||
X = ww.DataTable(X) | |||
|
|||
text_column_vals = X.select('natural_language') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I'll file that in woodwork. I don't think there should be a warning in this case; we can discuss on that ticket.
For this PR, you can add:
warnings.filterwarnings('ignore', 'The following selectors were not present in your DataTable')
pipeline = algo.next_batch()[0] | ||
expected_params = {'text_columns': ['text_col_1', 'text_col_2']} | ||
assert pipeline.parameters['Text Featurization Component'] == expected_params | ||
assert isinstance(pipeline.component_graph[0], TextFeaturizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, wouldn't pipeline.component_graph[0]
be the imputer in this case, and you want pipeline.component_graph[1]
? What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pipeline in this test isn't dynamically generated, it's defined a few lines above as only having the TextFeaturizer and an estimator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, thanks!
Performance testing doc is here |
@eccabay thanks, I just reviewed the perf tests. Looks great! |
@eccabay to fix the |
Closes #908 and fixes #978 by integrating the
TextFeaturizer
component intoautoml
and addingnlp_primitives
tocore-requirements.txt
. The component lives between theImputer
and theOneHotEncoder
in dynamically generated pipeline graphs.Performance tests are not at a point where graphs can be generated, but I will post those here once available.