Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entity.plot compute dask entitysets from delayed #1086

Merged
merged 10 commits into from Aug 4, 2020

Conversation

systemshift
Copy link
Contributor

@systemshift systemshift commented Jul 26, 2020

Pull Request Description

This is a first draft that should fix #1051
I will work on the failing test unit if the changes are approved.

My concern is that while this change 'solves' dask entity sets being plotted, it only works on low memory use cases. And would break when a user has large entityset and keeps calling .compute() inside a loop. My assumption is that a dask user would otherwise use pandas if they did not need to work on extremely large data.

Thoughts @rwedge?

update: might be another task to add to #901


After creating the pull request: in order to pass the changelog_updated check you will need to update the "Future Release" section of docs/source/changelog.rst to include this pull request.

Copy link
Contributor

@rwedge rwedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting up this PR, @systemshift

It would be good to avoid the compute. Perhaps es.plot should not display the row counts for a dask entityset. That way there would be no need for a compute.

@@ -23,6 +24,7 @@ Changelog
* Implement automated process for checking critical dependencies (:pr:`1045`, :pr:`1054`, :pr:`1081`)
* Don't run changelog check for release PRs or automated dependency PRs (:pr:`1057`)
* Fix non-deterministic behavior in Dask test causing codecov issues (:pr:`1070`)
* Remove xfail cases from ``test_plotting.py`` (:pr:`1086`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main change is fixing EntitySet.plot, an additional changelog entry is unnecessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an additional changelog entry is unnecessary

Not sure what you mean by this, should I remove the changes I made to testing section?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by this, should I remove the changes I made to testing section?

Sorry, I meant remove this line from the changelog, since we already have the line added in the Fixes section.

@systemshift
Copy link
Contributor Author

Perhaps es.plot should not display the row counts for a dask entityset.

Wouldn't this mean a dask user will lose some function features?

@rwedge
Copy link
Contributor

rwedge commented Jul 27, 2020

Perhaps es.plot should not display the row counts for a dask entityset.

Wouldn't this mean a dask user will lose some function features?

It would, but this allows a dask user to generate the plot without needing a .compute(). There's currently a similar compromise when printing an EntiySet object:

Entityset: ecommerce
  Entities:
    régions [Rows: Delayed('int-aadf06bc-0566-49f6-b2ff-7dc4165c4108'), Columns: 2]
    stores [Rows: Delayed('int-3b836190-bb8b-4e7b-a9ca-265f8a24ab08'), Columns: 3]
    products [Rows: Delayed('int-ccb84bf7-0d44-4561-9961-e4928895b152'), Columns: 4]
    customers [Rows: Delayed('int-31180abd-1530-4ff7-9921-8acb542985e8'), Columns: 15]
    sessions [Rows: Delayed('int-e6c393be-30c7-4faa-abe3-7044f8b55122'), Columns: 6]
    log [Rows: Delayed('int-e186025b-a5f4-45cc-a3e4-fb1a6ffa6457'), Columns: 15]
    cohorts [Rows: Delayed('int-fecffa69-0a66-4884-ba98-829d7c5fce12'), Columns: 3]
  Relationships:
    customers.cohort -> cohorts.cohort
    customers.région_id -> régions.id
    stores.région_id -> régions.id
    sessions.customer_id -> customers.id
    log.session_id -> sessions.id
    log.product_id -> products.id

I think the shorter time creating the plot due to not needing to compute is worth the loss of information

if isinstance(entity.df, dd.DataFrame): # entity is a dask entity
label = '{%s |%s\l}' % (entity.id, variables_string) # noqa: W605
else:
nrows = entity.shape[0]
label = '{%s (%d row%s)|%s\l}' % (entity.id, nrows, 's' * (nrows > 1), variables_string) # noqa: W605
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rwedge if this looks good I will start working on the broken test units to finish it up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@codecov
Copy link

codecov bot commented Aug 2, 2020

Codecov Report

Merging #1086 into main will decrease coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1086      +/-   ##
==========================================
- Coverage   98.36%   98.36%   -0.01%     
==========================================
  Files         126      126              
  Lines       13272    13257      -15     
==========================================
- Hits        13055    13040      -15     
  Misses        217      217              
Impacted Files Coverage Δ
featuretools/entityset/entityset.py 97.66% <100.00%> (+0.01%) ⬆️
...eaturetools/tests/entityset_tests/test_plotting.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b5f7ccf...e483bf0. Read the comment docs.

@systemshift systemshift requested a review from rwedge August 2, 2020 12:46
docs/source/changelog.rst Outdated Show resolved Hide resolved
@systemshift systemshift requested a review from rwedge August 3, 2020 23:25
@rwedge rwedge added this to the August Milestone milestone Aug 4, 2020
Copy link
Contributor

@rwedge rwedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@rwedge rwedge merged commit c805937 into alteryx:main Aug 4, 2020
2 of 3 checks passed
@thehomebrewnerd thehomebrewnerd mentioned this pull request Aug 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update EntitySet.plot() to work with Dask entitysets
2 participants