Skip to content

Integrate TextFeaturizer to automl #1062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 38 commits into from
Oct 28, 2020
Merged

Integrate TextFeaturizer to automl #1062

merged 38 commits into from
Oct 28, 2020

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Aug 14, 2020

Closes #908 and fixes #978 by integrating the TextFeaturizer component into automl and adding nlp_primitives to core-requirements.txt. The component lives between the Imputer and the OneHotEncoder in dynamically generated pipeline graphs.

Performance tests are not at a point where graphs can be generated, but I will post those here once available.

@codecov
Copy link

codecov bot commented Aug 14, 2020

Codecov Report

Merging #1062 into main will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1062      +/-   ##
==========================================
+ Coverage   99.95%   99.95%   +0.01%     
==========================================
  Files         213      213              
  Lines       13835    13857      +22     
==========================================
+ Hits        13828    13850      +22     
  Misses          7        7              
Impacted Files Coverage Δ
evalml/__init__.py 100.00% <100.00%> (ø)
...lml/automl/automl_algorithm/iterative_algorithm.py 100.00% <100.00%> (ø)
evalml/automl/automl_search.py 99.62% <100.00%> (+0.01%) ⬆️
...ents/transformers/preprocessing/text_featurizer.py 100.00% <100.00%> (ø)
evalml/pipelines/utils.py 100.00% <100.00%> (ø)
evalml/tests/automl_tests/test_automl.py 100.00% <100.00%> (ø)
...lml/tests/automl_tests/test_iterative_algorithm.py 100.00% <100.00%> (ø)
...alml/tests/component_tests/test_text_featurizer.py 100.00% <100.00%> (ø)
evalml/tests/pipeline_tests/test_pipelines.py 100.00% <100.00%> (ø)
evalml/tests/utils_tests/test_cli_utils.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b40aae1...ba1be2f. Read the comment docs.

@eccabay eccabay marked this pull request as ready for review October 27, 2020 17:36
@@ -389,9 +386,14 @@ def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_pl
X = pd.DataFrame(X)
X = ww.DataTable(X)

text_column_vals = X.select('natural_language')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This produces a warning if there are no text columns in X, open to any suggestions on how to suppress that!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I'll file that in woodwork. I don't think there should be a warning in this case; we can discuss on that ticket.

For this PR, you can add:

warnings.filterwarnings('ignore', 'The following selectors were not present in your DataTable')

Copy link
Contributor

@freddyaboulton freddyaboulton Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would this work?

from woodwork.logical_types import NaturalLanguage

 [col_name for col_name, ltype in X.logical_types.items() if ltype == NaturalLanguage]

@gsheni what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed alteryx/woodwork#322 for this warning

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that should work. We may fix this due to the issue Dylan raised.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton oops I had written back but didn't submit my comments until after you commented 😆

I think we can just suppress the warning, yeah?

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay Looks good to me!

@@ -438,6 +440,7 @@ def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_pl
max_iterations=self.max_iterations,
allowed_pipelines=self.allowed_pipelines,
tuner_class=self.tuner_class,
text_columns=text_columns,
Copy link
Contributor

@freddyaboulton freddyaboulton Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make this change? Can't the logic for identifying the natural language columns be delegated to the fit method of the text transformers (if none are passed into init)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton you're right, we could delegate this to the fit method of the text transformers. However, I think this is better. This PR as it stands will use the woodwork datatable to detect and select text features. If users don't like the default selections, they can override and pass a new datatable into search. But if we had the text featurizer do the detection, users wouldn't be able to override as easily. This is one of the big value adds of the datatables for us--it provides users with a clear way to control how automl search treats each feature, and to change the behavior!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long-term, we'd like all pipelines, components and utils to be able to accept woodwork datatables (#1288). At that point, we can delete this text_columns plumbing in favor of the datatables

})
y = [0, 1, 1, 0, 1, 0]
automl = AutoMLSearch(problem_type='binary')
automl.search(X, y, data_checks='disabled')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a comment explaining why the data checks are disabled? Is it because there isn't enough data for 3-fold cv?

@@ -30,6 +31,16 @@ def test_iterative_algorithm_allowed_pipelines(logistic_regression_binary_pipeli
assert algo.allowed_pipelines == allowed_pipelines


class MockEstimator(Estimator):
Copy link
Contributor

@freddyaboulton freddyaboulton Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a fixture for a mock classifier called dummy_classifier_estimator_class would that work in this test?

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay this is🔥 ! Really great test coverage.

All that's blocking IMO is:

  • I left a suggestion for how to suppress the woodwork warning in automl search text feature selection.
  • I left one question about test_iterative_algorithm_instantiates_text
  • Left comment about adding feature_types to breaking changes
  • Perf test results

woodwork>=0.0.3
featuretools>=0.20.0
nlp-primitives>=1.1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, nice!!

For others reading, featuretools appears to have pretty good backwards compatibility for dfs, so if we get a feature request to lower this, we can certainly do that. I bet the same is true with nlp-primitives, but I think its the right call to fix at 1.1.0, because anything at/after that version will have our new treatment for including the nltk corpuses during installation, rather than downloading at first run.

@@ -14,6 +14,7 @@ Release Notes
* Updated ``AutoMLSearch`` to support ``Woodwork`` data structures :pr:`1299`
* Added cv_folds to ``ClassImbalanceDataCheck`` and added this check to ``DefaultDataChecks`` :pr:`1333`
* Make ``max_batches`` argument to ``AutoMLSearch.search`` public :pr:`1320`
* Included TextFeaturizer in options for automl search :pr:`1062`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! To make this easier for users to understand, could you say "Added text support to automl search"?

@@ -111,6 +113,10 @@ def _transform_parameters(self, pipeline_class, proposed_parameters):
component_parameters = proposed_parameters.get(component_class.name, {})
init_params = inspect.signature(component_class.__init__).parameters

# Add the text columns parameter if the component is a TextFeaturizer
if component_class.name == "Text Featurization Component":
component_parameters['text_columns'] = self._text_columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

I'm hoping that we can remove this pathway in #1229 , when we add support for woodwork to pipelines and components, because at that point the text featurizer can access the woodwork datatable directly!

@@ -353,17 +353,14 @@ def _set_data_split(self, X):

self.data_split = self.data_split or default_data_split

def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_plot=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great call! This is totally dead. I genuinely don't know why we didn't delete it sooner 😆

Because its changing the method signature of search, just for completeness, let's add an entry to the breaking changes, like

Removed unused argument feature_types from AutoMLSearch.search

@@ -389,9 +386,14 @@ def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_pl
X = pd.DataFrame(X)
X = ww.DataTable(X)

text_column_vals = X.select('natural_language')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I'll file that in woodwork. I don't think there should be a warning in this case; we can discuss on that ticket.

For this PR, you can add:

warnings.filterwarnings('ignore', 'The following selectors were not present in your DataTable')

pipeline = algo.next_batch()[0]
expected_params = {'text_columns': ['text_col_1', 'text_col_2']}
assert pipeline.parameters['Text Featurization Component'] == expected_params
assert isinstance(pipeline.component_graph[0], TextFeaturizer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, wouldn't pipeline.component_graph[0] be the imputer in this case, and you want pipeline.component_graph[1]? What am I missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pipeline in this test isn't dynamically generated, it's defined a few lines above as only having the TextFeaturizer and an estimator

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, thanks!

@eccabay
Copy link
Contributor Author

eccabay commented Oct 28, 2020

Performance testing doc is here

@dsherry
Copy link
Contributor

dsherry commented Oct 28, 2020

@eccabay thanks, I just reviewed the perf tests. Looks great!

@dsherry
Copy link
Contributor

dsherry commented Oct 28, 2020

@eccabay to fix the check_dependencies_updated_linux failure, I think you gotta update evalml/tests/latest_dependency_versions.txt on your branch--it looks like moving featuretools/nlp_primitives to core added distributed, and that's currently on our watch-list in the circleci config for the job. We could also take it off the watch list, but for now let's just add it to the deps versions, yeah?

@eccabay eccabay merged commit 83343aa into main Oct 28, 2020
@dsherry dsherry mentioned this pull request Oct 29, 2020
@eccabay eccabay deleted the 908_integrate_text branch November 2, 2020 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants