Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add __setitem__ method on EntitySet to overload add_dataframe method #1862

Merged
merged 12 commits into from
Jan 27, 2022

Conversation

dvreed77
Copy link
Contributor

Pull Request Description

@codecov
Copy link

codecov bot commented Jan 24, 2022

Codecov Report

Merging #1862 (ffbe8fd) into main (dfdd2b4) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1862   +/-   ##
=======================================
  Coverage   98.82%   98.82%           
=======================================
  Files         145      145           
  Lines       16268    16277    +9     
=======================================
+ Hits        16077    16086    +9     
  Misses        191      191           
Impacted Files Coverage Δ
featuretools/entityset/entityset.py 99.21% <100.00%> (+<0.01%) ⬆️
featuretools/tests/entityset_tests/test_es.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dfdd2b4...ffbe8fd. Read the comment docs.

@dvreed77 dvreed77 changed the title WIP: __setitem__ method overloads add_dataframe on EntitySet __setitem__ method overloads add_dataframe on EntitySet Jan 24, 2022
@gsheni gsheni requested review from rwedge and removed request for gsheni January 24, 2022 20:34
Copy link
Contributor

@thehomebrewnerd thehomebrewnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we will want to update this guide in the documentation to discuss this approach to adding a dataframe to an EntitySet.

As part of this, we should mention the limitations of this approach, namely that if the dataframe does not have Woodwork initialized, the first column will be assumed to be the index column and all column types will be inferred by Woodwork. If users want to set the time index or have control over logical types, they should first initialize Woodwork on the dataframe with the proper parameters before assigning the dataframe to the EntitySet with this method.

@@ -8,6 +8,7 @@
import numpy as np
import pandas as pd
import pytest
from mock import patch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mock is part of the standard library now

from unittest.mock import patch

@gsheni gsheni changed the title __setitem__ method overloads add_dataframe on EntitySet Add __setitem__ method on EntitySet to overload add_dataframe method Jan 25, 2022
Copy link
Contributor

@thehomebrewnerd thehomebrewnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing major, just a couple things to clean up in release notes and a couple suggestions for the documentation.

@@ -35,6 +35,7 @@ v1.4.0 Jan 10, 2022
* Skip code coverage for specific dask usage lines (:pr:`1829`)
* Increase minimum required numpy version to 1.21.0, scipy to 1.3.3, koalas to 1.8.1 (:pr:`1833`)
* Remove pyyaml as a requirement (:pr:`1833`)
* Adds `__setitem__` method to overload `add_dataframe` method on EntitySet (:pr:`1862`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to move to the Future Release section above. Also, in order to get code formatting to work correctly, you will need to use double back-tics around the methods.

@@ -45,7 +46,7 @@ v1.4.0 Jan 10, 2022
* Remove fastparquet as a test requirement (:pr:`1833`)

Thanks to the following people for contributing to this release:
:user:`davesque`, :user:`gsheni`, :user:`rwedge`, :user:`thehomebrewnerd`
:user:`davesque`, :user:`gsheni`, :user:`rwedge`, :user:`thehomebrewnerd`, :user:`dvreed77`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to Future Release instead

"* The ``logical_types`` parameter indicates that \"product_id\" should be interpreted as a Categorical column, even though it is just an integer in the underlying data."
"* The ``logical_types`` parameter indicates that \"product_id\" should be interpreted as a Categorical column, even though it is just an integer in the underlying data.\n",
"\n",
"You can also use a setter on the ``EntitySet`` object to add dataframes\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might flow better if we move this a little lower in the section, after we demonstrate how to use add_dataframe.

Also, what do you think about putting this in a Note callout, like we do below with the description of plot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to put this into a note callout, but they don't support code blocks or lists. I guess I could reword it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok, didn't realize that. No big deal on the note formatting, but if we can try somehow separate it from the flow of the example, that might be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually lists work, but can't get codeblocks to work. I reformatted it

"\n",
"* if the DataFrame does not have [Woodwork](https://woodwork.alteryx.com/) initialized, the first column will be the index column\n",
"* if the DataFrame does not have Woodwork initialized, all columns will be inferred by Woodwork.\n",
"* control over time index and logical types will need be initialized through Woodwork before adding."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested rewording:
if control over the time index column and logical types is needed, Woodwork should be initialized before adding the dataframe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

@thehomebrewnerd thehomebrewnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but might be good to get someone else to review and approve before merging, just to make sure I'm not forgetting something that we wanted to do with this update.

"\n",
".. note ::\n",
"\n",
" You can also use a setter on the ``EntitySet`` object to add dataframes\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeat of the sentence above?

" that this will use the default implementation of `add_dataframe`, notably the following:\n",
"\n",
" * if the DataFrame does not have `Woodwork <https://woodwork.alteryx.com/>`_ initialized, the first column will be the index column\n",
" * if the DataFrame does not have Woodwork initialized, all columns will be inferred by Woodwork.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wording nitpick - can this say

all columns' logical types will be inferred by Woodwork

@@ -671,6 +671,9 @@ def add_dataframe(self,
self._add_references_to_metadata(dataframe)
return self

def __setitem__(self, key, value):
self.add_dataframe(dataframe=value, dataframe_name=key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth handling the situation where a user passes in a woodwork-initialized dataframe that doesn't have a df.ww.name set (or if the df.ww.name is different from key) differently than add_dataframe does?

Currently the first situation would end up with an error Cannot add a Woodwork DataFrame to EntitySet without a name and the second raises a warning that the dataframe_name parameter is going to be ignored in favor of the name at df.ww.name.

I know there was some discussion with @thehomebrewnerd and @rwedge when we decided on that behavior for add_dataframe in the first place, but somehow setitem feels different to me. I know in woodwork's setitem, we just use the key value for a series' name even if we have to overwrite an already-existing name, and I think it might be work updating the woodwork name on the dataframe before calling add_dataframe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tamargrey: When looking at the behavior of add_dataframe when using a Woodwork initialized dataframe without a name, I do find it strange to get an error. I would think add_dataframe, when given a name, would add that name to the Woodwork dataframe. Maybe that behavior should change? I feel pretty strongly that __setitem__ should be a very thin layer over add_dataframe and call it with its default values, and let add_dataframe handle errors, warnings, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, a partially initialized Woodwork Dataframe causes all the issues that are described in the Note text, so maybe that is enough in this case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the first case should be handled by having add_dataframe add a name to woodwork-initialized dataframes that lack a name. There's a larger issue #1740 discussing it

Copy link
Contributor Author

@dvreed77 dvreed77 Jan 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @gsheni can chime in here, but I believe the underlying spirit of this ticket is to improve UX with Featuretools. So if I have an EntitySet and I just want to quickly add a dataframe (that hasn't been WW-initialized), I can just do es['dataframe'] = df. If the user wants more granular control, they can do that by using add_dataframe and additionally initializing the df using WW before adding it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice for this to also work with already woodworked dataframes but I'm fine with those edge cases being handled in other tickets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's go for a good initial pass and then we can improve on it later on.

@dvreed77 dvreed77 merged commit 392c198 into main Jan 27, 2022
@dvreed77 dvreed77 deleted the issue-1685-setitem-method-overloads-add_dataframe branch January 27, 2022 14:24
@dvreed77 dvreed77 mentioned this pull request Feb 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add dataframe with entityset setter
5 participants