Upgrade to support Featuretools 1.0.0 and nlp-primitives 2.0.0#2848
Upgrade to support Featuretools 1.0.0 and nlp-primitives 2.0.0#2848chukarsten merged 24 commits intomainfrom
Conversation
a7ae78b to
31c7d28
Compare
31c7d28 to
c500472
Compare
Codecov Report
@@ Coverage Diff @@
## main #2848 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 302 302
Lines 28394 28396 +2
=======================================
+ Hits 28301 28303 +2
Misses 93 93
Continue to review full report at Codecov.
|
| @@ -12,8 +12,8 @@ psutil>=5.6.6 | |||
| requirements-parser>=0.2.0 | |||
| shap>=0.36.0 | |||
| texttable>=1.6.2 | |||
There was a problem hiding this comment.
When featuretools and nlp-primitives release, remove the "rc1"s. Featuretools requires the update to ww 0.8.1. Bumped the minimum of the reqs to match the new minima for dask and pandas.
There was a problem hiding this comment.
Should we do the ww 0.8.1 upgrade in a separate PR which we merge first? If this PR is ready to go I suppose it doesn't really matter at this point 😆
| def _make_entity_set(self, X): | ||
| """Helper method that creates and returns the entity set given the input data.""" | ||
| ft_es = EntitySet() | ||
| del(X.ww) |
There was a problem hiding this comment.
I need feedback here. The reason I did this was to try to accommodate X being a WW initialized but unnamed dataframe. If I delete the accessor, the add_dataframe() function will follow its branch to re-initialize the dataframe and name it with the dataframe_name param. If X remains unnamed but WW initialized, the add_dataframe raises a series of Exceptions, the first being that it requires a named dataframe.
There was a problem hiding this comment.
I think the del is fine here as long as we document the reasoning for it (and we can't find a better solution). Alternatively, we could name and then remove the name as well which makes for a more involved solution but doesn't include a scary del here!
There was a problem hiding this comment.
Should we file a featuretools issue to allow for dataframes with ww and without names?
Rather than deleting the accessor, can we just give the dataframe a name?
There was a problem hiding this comment.
I agree with @freddyaboulton here, it'd be nice if featuretools could support ww dataframes without names. I'm okay with not blocking this though, and filing an issue to track replacing this once we make the feature request and it gets merged :)
There was a problem hiding this comment.
I think the issue here is potentially a bit bigger than just the name. Currently if a ww dataframe is passed to EntitySet.add_dataframe featuretools is expecting that everything is already set up and ready to go, meaning any values passed in for index, time_index or make_index (and any other arguments) will be ignored.
That means even if you add a name to the dataframe before passing it to add_dataframe, Featuretools will still not work as you expect because the index and make_index values will be ignored.
If you also need to set the index or use make index during the add call, we are going to have to change the add_dataframe code in Featuretools to use those values.
Given that, I think the best bet for now is to stick with the del(X.ww) approach...
| @@ -50,19 +50,19 @@ def test_featuretools_index(mock_calculate_feature_matrix, mock_dfs, X_y_multi): | |||
| feature = DFSTransformer() | |||
| feature.fit(X_new_index) | |||
| feature.transform(X_new_index) | |||
There was a problem hiding this comment.
I'm not 100% sure on this change. @thehomebrewnerd do you have any feedback on this one? I wasn't sure whether entities was supposed to exist as an attribute of EntitySet.
There was a problem hiding this comment.
entities is gone as an attribute now and replaced with dataframes
| @@ -144,10 +144,10 @@ def test_ft_woodwork_custom_overrides_returned_by_components(X_df): | |||
| assert isinstance(transformed, pd.DataFrame) | |||
| if logical_type == Datetime: | |||
| assert {k: type(v) for k, v in transformed.ww.logical_types.items()} == { | |||
There was a problem hiding this comment.
I think this change is a result of this.
There was a problem hiding this comment.
Cool, this makes sense given what these features mean!
There was a problem hiding this comment.
Although the naming is kinda weird, like was the original feature name "0"? That must be it
bffc98d to
57e5626
Compare
8dc4d47 to
17787a9
Compare
…aturetools 1.0.0 dev branch.
…ools dev branches.
…ew, higher min reqs.
5d88630 to
6f4f7cf
Compare
jeremyliweishih
left a comment
There was a problem hiding this comment.
this looks good to me and I'm fine with the del solution as long as we have proper documentation for it!
thehomebrewnerd
left a comment
There was a problem hiding this comment.
Just a couple other minor comments in addition to those by @jeremyliweishih and @freddyaboulton. I'd like to think a bit more about the naming issue during EntitySet creation - deleting the accessor and reinitializing isn't ideal.
evalml/pipelines/components/transformers/preprocessing/text_featurizer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/transform_primitive_components.py
Outdated
Show resolved
Hide resolved
| @@ -50,19 +50,19 @@ def test_featuretools_index(mock_calculate_feature_matrix, mock_dfs, X_y_multi): | |||
| feature = DFSTransformer() | |||
| feature.fit(X_new_index) | |||
| feature.transform(X_new_index) | |||
There was a problem hiding this comment.
entities is gone as an attribute now and replaced with dataframes
freddyaboulton
left a comment
There was a problem hiding this comment.
Thank you @chukarsten for doing this and thank you the @thehomebrewnerd for the thoughtful review!
Given @thehomebrewnerd 's comment about del X.ww, I think this is good to go. The featuretools component is not currently used by default in AutoMLSearch (for now) so I think we have time to figure out if there's a better approach. The @thehomebrewnerd would you be ok with us filing a ft issue?
Of course! I think we can add the functionality you want, but might need to take a little time to think through the best way to do it... |
|
Ok @thehomebrewnerd @freddyaboulton , I'm going to make the minor changes and push. We'll move forward with this and I'll post the issue filed against Featuretools here. Thanks everyone for the great reviews and discussion. |
…sor deletion for later removal.
core-requirements.txt
Outdated
| nlp-primitives>=1.1.0 | ||
| woodwork==0.8.1 | ||
| dask>=2021.2.0 | ||
| nlp-primitives>=1.0.0 |
There was a problem hiding this comment.
Hmm, I'm confused
- Old req here was "nlp-primitives>=1.1.0"
- New code lowers min to "nlp-primitives>=1.0.0"
- Changelog says "nlp-primitives 2.0.0"
Which version of nlp-primitives are we trying to install? If its 2.0.0, shouldn't we say "nlp-primitives>=2.0.0" here?
Otherwise we're telling users they can install nlp-primitives 1.0.0 with ft 1.0.0, and that will be buggy
There was a problem hiding this comment.
Yep! You are correct. I think I just got confused. I moved it to nlp-primitives>=2.0.0
| ft_es = EntitySet() | ||
| # TODO: This delete was introduced for compatibility with Featuretools 1.0.0. This should | ||
| # be removed after Featuretools handles unnamed dataframes being passed to this function. | ||
| del X.ww |
There was a problem hiding this comment.
Could you please explain why this is necessary?
There was a problem hiding this comment.
@dsherry Somewhat short version of the discussion - currently Featuretools doesn't allow you to specify values for other arguments if you pass a a woodwork dataframe to EntitySet.add_dataframe. So, if you need to do things like make_index or set the name during the add_dataframe call, you need to first clear out woodwork otherwise FT doesn't work as expected. This can be fixed in the future, but that's how it is for now.
There was a problem hiding this comment.
Issue filed here: alteryx/featuretools#1740
But long story short: adding a dataframe to an EntitySet requires either a WW init'd Dataframe with a name and an index, or just a dataframe that's not init'd. It doesn't handle a WW init'd Dataframe without a name. We submitted a request to handle what EvalML frequently sends to this function, a WW init'd df without a name, but until then, we can turn them into regular dataframes and let the infer_feature_types convert it back later.
evalml/tests/dependency_update_check/minimum_core_requirements.txt
Outdated
Show resolved
Hide resolved
| featuretools==0.26.1 | ||
| nlp-primitives==1.1.0 | ||
| woodwork==0.8.1 | ||
| dask==2021.2.0 |
There was a problem hiding this comment.
@dsherry Dask switched their versioning approach a while back, so this is valid.
There was a problem hiding this comment.
I don't think so. Dask changed how they do their version numbering. 2.12.0 is pretty old. https://github.com/dask/dask/releases?after=2.14.0
dsherry
left a comment
There was a problem hiding this comment.
@chukarsten I found a few things which need updating before this is mergeable, feel free to dismiss my change request if I'm not around
|
Ok, I added this issue to featuretools so we can revisit a refactor here. |
|
Also filed a related EvalML issue #2910 so that we can backlog it and put it into our workflow. |
Addresses: #2816