Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to support Featuretools 1.0.0 and nlp-primitives 2.0.0 #2848

Merged
merged 24 commits into from
Oct 14, 2021

Conversation

chukarsten
Copy link
Contributor

Addresses: #2816

@codecov
Copy link

codecov bot commented Sep 29, 2021

Codecov Report

Merging #2848 (988eaaf) into main (3948c5d) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2848     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        302     302             
  Lines      28394   28396      +2     
=======================================
+ Hits       28301   28303      +2     
  Misses        93      93             
Impacted Files Coverage Δ
...ponents/transformers/preprocessing/featuretools.py 100.0% <100.0%> (ø)
...ents/transformers/preprocessing/text_featurizer.py 100.0% <100.0%> (ø)
...rs/preprocessing/transform_primitive_components.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_featuretools.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3948c5d...988eaaf. Read the comment docs.

@@ -12,8 +12,8 @@ psutil>=5.6.6
requirements-parser>=0.2.0
shap>=0.36.0
texttable>=1.6.2
Copy link
Contributor Author

@chukarsten chukarsten Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When featuretools and nlp-primitives release, remove the "rc1"s. Featuretools requires the update to ww 0.8.1. Bumped the minimum of the reqs to match the new minima for dask and pandas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do the ww 0.8.1 upgrade in a separate PR which we merge first? If this PR is ready to go I suppose it doesn't really matter at this point 😆

@@ -31,13 +31,14 @@ def __init__(self, index="index", random_seed=0, **kwargs):
def _make_entity_set(self, X):
"""Helper method that creates and returns the entity set given the input data."""
ft_es = EntitySet()
del(X.ww)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need feedback here. The reason I did this was to try to accommodate X being a WW initialized but unnamed dataframe. If I delete the accessor, the add_dataframe() function will follow its branch to re-initialize the dataframe and name it with the dataframe_name param. If X remains unnamed but WW initialized, the add_dataframe raises a series of Exceptions, the first being that it requires a named dataframe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the del is fine here as long as we document the reasoning for it (and we can't find a better solution). Alternatively, we could name and then remove the name as well which makes for a more involved solution but doesn't include a scary del here!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we file a featuretools issue to allow for dataframes with ww and without names?

Rather than deleting the accessor, can we just give the dataframe a name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @freddyaboulton here, it'd be nice if featuretools could support ww dataframes without names. I'm okay with not blocking this though, and filing an issue to track replacing this once we make the feature request and it gets merged :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue here is potentially a bit bigger than just the name. Currently if a ww dataframe is passed to EntitySet.add_dataframe featuretools is expecting that everything is already set up and ready to go, meaning any values passed in for index, time_index or make_index (and any other arguments) will be ignored.

That means even if you add a name to the dataframe before passing it to add_dataframe, Featuretools will still not work as you expect because the index and make_index values will be ignored.

If you also need to set the index or use make index during the add call, we are going to have to change the add_dataframe code in Featuretools to use those values.

Given that, I think the best bet for now is to stick with the del(X.ww) approach...

@@ -50,19 +50,19 @@ def test_featuretools_index(mock_calculate_feature_matrix, mock_dfs, X_y_multi):
feature = DFSTransformer()
feature.fit(X_new_index)
feature.transform(X_new_index)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure on this change. @thehomebrewnerd do you have any feedback on this one? I wasn't sure whether entities was supposed to exist as an attribute of EntitySet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entities is gone as an attribute now and replaced with dataframes

@@ -144,10 +144,10 @@ def test_ft_woodwork_custom_overrides_returned_by_components(X_df):
assert isinstance(transformed, pd.DataFrame)
if logical_type == Datetime:
assert {k: type(v) for k, v in transformed.ww.logical_types.items()} == {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is a result of this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, this makes sense given what these features mean!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the naming is kinda weird, like was the original feature name "0"? That must be it

@chukarsten chukarsten force-pushed the feature_tools_upgrade branch 2 times, most recently from bffc98d to 57e5626 Compare October 1, 2021 19:03
Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me and I'm fine with the del solution as long as we have proper documentation for it!

Copy link
Contributor

@thehomebrewnerd thehomebrewnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple other minor comments in addition to those by @jeremyliweishih and @freddyaboulton. I'd like to think a bit more about the naming issue during EntitySet creation - deleting the accessor and reinitializing isn't ideal.

core-requirements.txt Show resolved Hide resolved
@@ -50,19 +50,19 @@ def test_featuretools_index(mock_calculate_feature_matrix, mock_dfs, X_y_multi):
feature = DFSTransformer()
feature.fit(X_new_index)
feature.transform(X_new_index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entities is gone as an attribute now and replaced with dataframes

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @chukarsten for doing this and thank you the @thehomebrewnerd for the thoughtful review!

Given @thehomebrewnerd 's comment about del X.ww, I think this is good to go. The featuretools component is not currently used by default in AutoMLSearch (for now) so I think we have time to figure out if there's a better approach. The @thehomebrewnerd would you be ok with us filing a ft issue?

@thehomebrewnerd
Copy link
Contributor

@thehomebrewnerd would you be ok with us filing a ft issue?

Of course! I think we can add the functionality you want, but might need to take a little time to think through the best way to do it...

@chukarsten
Copy link
Contributor Author

Ok @thehomebrewnerd @freddyaboulton , I'm going to make the minor changes and push. We'll move forward with this and I'll post the issue filed against Featuretools here. Thanks everyone for the great reviews and discussion.

@chukarsten chukarsten changed the title Upgrade to support Featuretools 1.0.0rc Upgrade to support Featuretools 1.0.0 and nlp-primitives 2.0.0 Oct 14, 2021
nlp-primitives>=1.1.0
woodwork==0.8.1
dask>=2021.2.0
nlp-primitives>=1.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm confused

  • Old req here was "nlp-primitives>=1.1.0"
  • New code lowers min to "nlp-primitives>=1.0.0"
  • Changelog says "nlp-primitives 2.0.0"

Which version of nlp-primitives are we trying to install? If its 2.0.0, shouldn't we say "nlp-primitives>=2.0.0" here?

Otherwise we're telling users they can install nlp-primitives 1.0.0 with ft 1.0.0, and that will be buggy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! You are correct. I think I just got confused. I moved it to nlp-primitives>=2.0.0

@@ -31,14 +31,15 @@ def __init__(self, index="index", random_seed=0, **kwargs):
def _make_entity_set(self, X):
"""Helper method that creates and returns the entity set given the input data."""
ft_es = EntitySet()
# TODO: This delete was introduced for compatibility with Featuretools 1.0.0. This should
# be removed after Featuretools handles unnamed dataframes being passed to this function.
del X.ww
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why this is necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussion is here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry Somewhat short version of the discussion - currently Featuretools doesn't allow you to specify values for other arguments if you pass a a woodwork dataframe to EntitySet.add_dataframe. So, if you need to do things like make_index or set the name during the add_dataframe call, you need to first clear out woodwork otherwise FT doesn't work as expected. This can be fixed in the future, but that's how it is for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue filed here: alteryx/featuretools#1740

But long story short: adding a dataframe to an EntitySet requires either a WW init'd Dataframe with a name and an index, or just a dataframe that's not init'd. It doesn't handle a WW init'd Dataframe without a name. We submitted a request to handle what EvalML frequently sends to this function, a WW init'd df without a name, but until then, we can turn them into regular dataframes and let the infer_feature_types convert it back later.

featuretools==0.26.1
nlp-primitives==1.1.0
woodwork==0.8.1
dask==2021.2.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a typo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry Dask switched their versioning approach a while back, so this is valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Dask changed how they do their version numbering. 2.12.0 is pretty old. https://github.com/dask/dask/releases?after=2.14.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol got it thanks

setup.py Outdated Show resolved Hide resolved
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chukarsten I found a few things which need updating before this is mergeable, feel free to dismiss my change request if I'm not around

@chukarsten
Copy link
Contributor Author

Ok, I added this issue to featuretools so we can revisit a refactor here.

@chukarsten
Copy link
Contributor Author

Also filed a related EvalML issue #2910 so that we can backlog it and put it into our workflow.

@chukarsten chukarsten dismissed dsherry’s stale review October 14, 2021 19:02

Changes were addressed!

@chukarsten chukarsten merged commit 5c85614 into main Oct 14, 2021
@chukarsten chukarsten mentioned this pull request Oct 14, 2021
@chukarsten chukarsten deleted the feature_tools_upgrade branch October 14, 2021 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants