-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add __setitem__ method on EntitySet to overload add_dataframe method #1862
Add __setitem__ method on EntitySet to overload add_dataframe method #1862
Conversation
…thod-overloads-add_dataframe
…thod-overloads-add_dataframe
Codecov Report
@@ Coverage Diff @@
## main #1862 +/- ##
=======================================
Coverage 98.82% 98.82%
=======================================
Files 145 145
Lines 16268 16277 +9
=======================================
+ Hits 16077 16086 +9
Misses 191 191
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking we will want to update this guide in the documentation to discuss this approach to adding a dataframe to an EntitySet.
As part of this, we should mention the limitations of this approach, namely that if the dataframe does not have Woodwork initialized, the first column will be assumed to be the index column and all column types will be inferred by Woodwork. If users want to set the time index or have control over logical types, they should first initialize Woodwork on the dataframe with the proper parameters before assigning the dataframe to the EntitySet with this method.
@@ -8,6 +8,7 @@ | |||
import numpy as np | |||
import pandas as pd | |||
import pytest | |||
from mock import patch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mock is part of the standard library now
from unittest.mock import patch
…thod-overloads-add_dataframe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing major, just a couple things to clean up in release notes and a couple suggestions for the documentation.
docs/source/release_notes.rst
Outdated
@@ -35,6 +35,7 @@ v1.4.0 Jan 10, 2022 | |||
* Skip code coverage for specific dask usage lines (:pr:`1829`) | |||
* Increase minimum required numpy version to 1.21.0, scipy to 1.3.3, koalas to 1.8.1 (:pr:`1833`) | |||
* Remove pyyaml as a requirement (:pr:`1833`) | |||
* Adds `__setitem__` method to overload `add_dataframe` method on EntitySet (:pr:`1862`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to move to the Future Release section above. Also, in order to get code formatting to work correctly, you will need to use double back-tics around the methods.
docs/source/release_notes.rst
Outdated
@@ -45,7 +46,7 @@ v1.4.0 Jan 10, 2022 | |||
* Remove fastparquet as a test requirement (:pr:`1833`) | |||
|
|||
Thanks to the following people for contributing to this release: | |||
:user:`davesque`, :user:`gsheni`, :user:`rwedge`, :user:`thehomebrewnerd` | |||
:user:`davesque`, :user:`gsheni`, :user:`rwedge`, :user:`thehomebrewnerd`, :user:`dvreed77` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to Future Release instead
"* The ``logical_types`` parameter indicates that \"product_id\" should be interpreted as a Categorical column, even though it is just an integer in the underlying data." | ||
"* The ``logical_types`` parameter indicates that \"product_id\" should be interpreted as a Categorical column, even though it is just an integer in the underlying data.\n", | ||
"\n", | ||
"You can also use a setter on the ``EntitySet`` object to add dataframes\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might flow better if we move this a little lower in the section, after we demonstrate how to use add_dataframe
.
Also, what do you think about putting this in a Note callout, like we do below with the description of plot
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to put this into a note callout, but they don't support code blocks or lists. I guess I could reword it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok, didn't realize that. No big deal on the note formatting, but if we can try somehow separate it from the flow of the example, that might be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually lists work, but can't get codeblocks to work. I reformatted it
"\n", | ||
"* if the DataFrame does not have [Woodwork](https://woodwork.alteryx.com/) initialized, the first column will be the index column\n", | ||
"* if the DataFrame does not have Woodwork initialized, all columns will be inferred by Woodwork.\n", | ||
"* control over time index and logical types will need be initialized through Woodwork before adding." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested rewording:
if control over the time index column and logical types is needed, Woodwork should be initialized before adding the dataframe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but might be good to get someone else to review and approve before merging, just to make sure I'm not forgetting something that we wanted to do with this update.
"\n", | ||
".. note ::\n", | ||
"\n", | ||
" You can also use a setter on the ``EntitySet`` object to add dataframes\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeat of the sentence above?
" that this will use the default implementation of `add_dataframe`, notably the following:\n", | ||
"\n", | ||
" * if the DataFrame does not have `Woodwork <https://woodwork.alteryx.com/>`_ initialized, the first column will be the index column\n", | ||
" * if the DataFrame does not have Woodwork initialized, all columns will be inferred by Woodwork.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wording nitpick - can this say
all columns' logical types will be inferred by Woodwork
@@ -671,6 +671,9 @@ def add_dataframe(self, | |||
self._add_references_to_metadata(dataframe) | |||
return self | |||
|
|||
def __setitem__(self, key, value): | |||
self.add_dataframe(dataframe=value, dataframe_name=key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth handling the situation where a user passes in a woodwork-initialized dataframe that doesn't have a df.ww.name
set (or if the df.ww.name
is different from key
) differently than add_dataframe
does?
Currently the first situation would end up with an error Cannot add a Woodwork DataFrame to EntitySet without a name
and the second raises a warning that the dataframe_name
parameter is going to be ignored in favor of the name at df.ww.name
.
I know there was some discussion with @thehomebrewnerd and @rwedge when we decided on that behavior for add_dataframe
in the first place, but somehow setitem
feels different to me. I know in woodwork's setitem, we just use the key value for a series' name even if we have to overwrite an already-existing name, and I think it might be work updating the woodwork name on the dataframe before calling add_dataframe
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tamargrey: When looking at the behavior of add_dataframe
when using a Woodwork initialized dataframe without a name, I do find it strange to get an error. I would think add_dataframe
, when given a name, would add that name to the Woodwork dataframe. Maybe that behavior should change? I feel pretty strongly that __setitem__
should be a very thin layer over add_dataframe
and call it with its default values, and let add_dataframe
handle errors, warnings, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, a partially initialized Woodwork Dataframe causes all the issues that are described in the Note text, so maybe that is enough in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the first case should be handled by having add_dataframe
add a name to woodwork-initialized dataframes that lack a name. There's a larger issue #1740 discussing it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe @gsheni can chime in here, but I believe the underlying spirit of this ticket is to improve UX with Featuretools. So if I have an EntitySet and I just want to quickly add a dataframe (that hasn't been WW-initialized), I can just do es['dataframe'] = df
. If the user wants more granular control, they can do that by using add_dataframe
and additionally initializing the df using WW before adding it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice for this to also work with already woodworked dataframes but I'm fine with those edge cases being handled in other tickets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's go for a good initial pass and then we can improve on it later on.
Pull Request Description
__setitem__
method on EntitySet to overloadadd_dataframe
method