-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to support Featuretools 1.0.0 and nlp-primitives 2.0.0 #2848
Conversation
a7ae78b
to
31c7d28
Compare
31c7d28
to
c500472
Compare
Codecov Report
@@ Coverage Diff @@
## main #2848 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 302 302
Lines 28394 28396 +2
=======================================
+ Hits 28301 28303 +2
Misses 93 93
Continue to review full report at Codecov.
|
@@ -12,8 +12,8 @@ psutil>=5.6.6 | |||
requirements-parser>=0.2.0 | |||
shap>=0.36.0 | |||
texttable>=1.6.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When featuretools and nlp-primitives release, remove the "rc1"s. Featuretools requires the update to ww 0.8.1. Bumped the minimum of the reqs to match the new minima for dask and pandas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do the ww 0.8.1 upgrade in a separate PR which we merge first? If this PR is ready to go I suppose it doesn't really matter at this point 😆
@@ -31,13 +31,14 @@ def __init__(self, index="index", random_seed=0, **kwargs): | |||
def _make_entity_set(self, X): | |||
"""Helper method that creates and returns the entity set given the input data.""" | |||
ft_es = EntitySet() | |||
del(X.ww) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need feedback here. The reason I did this was to try to accommodate X being a WW initialized but unnamed dataframe. If I delete the accessor, the add_dataframe()
function will follow its branch to re-initialize the dataframe and name it with the dataframe_name
param. If X remains unnamed but WW initialized, the add_dataframe
raises a series of Exceptions, the first being that it requires a named dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the del
is fine here as long as we document the reasoning for it (and we can't find a better solution). Alternatively, we could name and then remove the name as well which makes for a more involved solution but doesn't include a scary del
here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we file a featuretools issue to allow for dataframes with ww and without names?
Rather than deleting the accessor, can we just give the dataframe a name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @freddyaboulton here, it'd be nice if featuretools could support ww dataframes without names. I'm okay with not blocking this though, and filing an issue to track replacing this once we make the feature request and it gets merged :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue here is potentially a bit bigger than just the name. Currently if a ww dataframe is passed to EntitySet.add_dataframe
featuretools is expecting that everything is already set up and ready to go, meaning any values passed in for index
, time_index
or make_index
(and any other arguments) will be ignored.
That means even if you add a name to the dataframe before passing it to add_dataframe
, Featuretools will still not work as you expect because the index
and make_index
values will be ignored.
If you also need to set the index or use make index during the add call, we are going to have to change the add_dataframe
code in Featuretools to use those values.
Given that, I think the best bet for now is to stick with the del(X.ww)
approach...
@@ -50,19 +50,19 @@ def test_featuretools_index(mock_calculate_feature_matrix, mock_dfs, X_y_multi): | |||
feature = DFSTransformer() | |||
feature.fit(X_new_index) | |||
feature.transform(X_new_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure on this change. @thehomebrewnerd do you have any feedback on this one? I wasn't sure whether entities was supposed to exist as an attribute of EntitySet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
entities
is gone as an attribute now and replaced with dataframes
@@ -144,10 +144,10 @@ def test_ft_woodwork_custom_overrides_returned_by_components(X_df): | |||
assert isinstance(transformed, pd.DataFrame) | |||
if logical_type == Datetime: | |||
assert {k: type(v) for k, v in transformed.ww.logical_types.items()} == { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is a result of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, this makes sense given what these features mean!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the naming is kinda weird, like was the original feature name "0"? That must be it
bffc98d
to
57e5626
Compare
8dc4d47
to
17787a9
Compare
…aturetools 1.0.0 dev branch.
…ools dev branches.
…ew, higher min reqs.
5d88630
to
6f4f7cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good to me and I'm fine with the del
solution as long as we have proper documentation for it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple other minor comments in addition to those by @jeremyliweishih and @freddyaboulton. I'd like to think a bit more about the naming issue during EntitySet creation - deleting the accessor and reinitializing isn't ideal.
evalml/pipelines/components/transformers/preprocessing/text_featurizer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/transform_primitive_components.py
Outdated
Show resolved
Hide resolved
@@ -50,19 +50,19 @@ def test_featuretools_index(mock_calculate_feature_matrix, mock_dfs, X_y_multi): | |||
feature = DFSTransformer() | |||
feature.fit(X_new_index) | |||
feature.transform(X_new_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
entities
is gone as an attribute now and replaced with dataframes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @chukarsten for doing this and thank you the @thehomebrewnerd for the thoughtful review!
Given @thehomebrewnerd 's comment about del X.ww
, I think this is good to go. The featuretools component is not currently used by default in AutoMLSearch (for now) so I think we have time to figure out if there's a better approach. The @thehomebrewnerd would you be ok with us filing a ft issue?
Of course! I think we can add the functionality you want, but might need to take a little time to think through the best way to do it... |
Ok @thehomebrewnerd @freddyaboulton , I'm going to make the minor changes and push. We'll move forward with this and I'll post the issue filed against Featuretools here. Thanks everyone for the great reviews and discussion. |
…sor deletion for later removal.
core-requirements.txt
Outdated
nlp-primitives>=1.1.0 | ||
woodwork==0.8.1 | ||
dask>=2021.2.0 | ||
nlp-primitives>=1.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm confused
- Old req here was "nlp-primitives>=1.1.0"
- New code lowers min to "nlp-primitives>=1.0.0"
- Changelog says "nlp-primitives 2.0.0"
Which version of nlp-primitives are we trying to install? If its 2.0.0, shouldn't we say "nlp-primitives>=2.0.0" here?
Otherwise we're telling users they can install nlp-primitives 1.0.0 with ft 1.0.0, and that will be buggy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! You are correct. I think I just got confused. I moved it to nlp-primitives>=2.0.0
@@ -31,14 +31,15 @@ def __init__(self, index="index", random_seed=0, **kwargs): | |||
def _make_entity_set(self, X): | |||
"""Helper method that creates and returns the entity set given the input data.""" | |||
ft_es = EntitySet() | |||
# TODO: This delete was introduced for compatibility with Featuretools 1.0.0. This should | |||
# be removed after Featuretools handles unnamed dataframes being passed to this function. | |||
del X.ww |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain why this is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The discussion is here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry Somewhat short version of the discussion - currently Featuretools doesn't allow you to specify values for other arguments if you pass a a woodwork dataframe to EntitySet.add_dataframe
. So, if you need to do things like make_index
or set the name during the add_dataframe
call, you need to first clear out woodwork otherwise FT doesn't work as expected. This can be fixed in the future, but that's how it is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue filed here: alteryx/featuretools#1740
But long story short: adding a dataframe to an EntitySet requires either a WW init'd Dataframe with a name and an index, or just a dataframe that's not init'd. It doesn't handle a WW init'd Dataframe without a name. We submitted a request to handle what EvalML frequently sends to this function, a WW init'd df without a name, but until then, we can turn them into regular dataframes and let the infer_feature_types convert it back later.
evalml/tests/dependency_update_check/minimum_core_requirements.txt
Outdated
Show resolved
Hide resolved
featuretools==0.26.1 | ||
nlp-primitives==1.1.0 | ||
woodwork==0.8.1 | ||
dask==2021.2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry Dask switched their versioning approach a while back, so this is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Dask changed how they do their version numbering. 2.12.0 is pretty old. https://github.com/dask/dask/releases?after=2.14.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol got it thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chukarsten I found a few things which need updating before this is mergeable, feel free to dismiss my change request if I'm not around
Ok, I added this issue to featuretools so we can revisit a refactor here. |
Also filed a related EvalML issue #2910 so that we can backlog it and put it into our workflow. |
Addresses: #2816