Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow dask dataframe during entity creation #783

Merged
merged 116 commits into from
Jun 4, 2020
Merged

Conversation

thehomebrewnerd
Copy link
Contributor

@thehomebrewnerd thehomebrewnerd commented Oct 24, 2019

Allows users to supply either a pandas dataframe or a dask dataframe when creating an entity with es.entity_from_dataframe.

Summary of Changes

  • Updated df[col].tolist() syntax to list(df[col]) since dask dataframes do not support .tolist()
  • Changed frame.shape[0] syntax to len(frame) as calling .shape() on a dask dataframe returns a delay object for the number of rows. Must use len(frame) to get the total number of rows.
  • Changed if df.empty syntax to if len(df) == 0 as dask dataframes to not have a .empty attribute.
  • Updated syntax because dask dataframes do not support inplace parameter for dropping variables or renaming columns.
  • In order to determine if a dask dataframe index has unique values, you must first call .compute() before .is_unique, so added logic to add this step for dask dataframes.
  • You cannot assign a list or np.array to a dask dataframe column as you can with pandas, so you must first create a dask dataframe from a dask array and then perform the assignment
  • Dask dataframes do not support using .iloc to select rows, so in order to get the first row of data, you must first call .head() or .compute() on the dataframe. .head() is better if you only want the first row, as .compute() will compute the full dataframe, which isn't needed.
  • Calling .set_index() on a dask dataframe generates an error if the index column passed is of type categorical. This can be fixed by calling .cat.as_ordered() on the index column passed in or by casting the column to type object with .astype(object) first. On a simple test, there was only a slight difference in performance but .astype(object) was 4% faster.
  • Updated sampling process in utils/entity_utils.py as dask dataframes do not support sampling by specifying the number of samples.
  • Many more...

@thehomebrewnerd thehomebrewnerd changed the title Allow dask dataframe during entity creation [WIP] Allow dask dataframe during entity creation Nov 6, 2019
@codecov-io
Copy link

codecov-io commented Jan 24, 2020

Codecov Report

Merging #783 into master will decrease coverage by 0.00%.
The diff coverage is 98.64%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #783      +/-   ##
==========================================
- Coverage   98.27%   98.27%   -0.01%     
==========================================
  Files         119      121       +2     
  Lines       11078    12338    +1260     
==========================================
+ Hits        10887    12125    +1238     
- Misses        191      213      +22     
Impacted Files Coverage Δ
...mputational_backend/test_feature_set_calculator.py 97.96% <92.40%> (-2.04%) ⬇️
...utational_backend/test_calculate_feature_matrix.py 98.13% <95.19%> (-1.30%) ⬇️
...s/computational_backends/feature_set_calculator.py 98.69% <98.57%> (+0.14%) ⬆️
...ools/tests/entityset_tests/test_last_time_index.py 99.53% <98.59%> (-0.47%) ⬇️
featuretools/tests/entityset_tests/test_es.py 99.84% <99.49%> (-0.16%) ⬇️
...computational_backends/calculate_feature_matrix.py 98.71% <100.00%> (+0.10%) ⬆️
featuretools/entityset/deserialize.py 100.00% <100.00%> (ø)
featuretools/entityset/entity.py 96.16% <100.00%> (+0.22%) ⬆️
featuretools/entityset/entityset.py 97.22% <100.00%> (+0.43%) ⬆️
featuretools/entityset/serialize.py 100.00% <100.00%> (ø)
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f8adb6...1f4fe5a. Read the comment docs.

frances-h and others added 2 commits June 2, 2020 15:56
docs/source/changelog.rst Outdated Show resolved Hide resolved
@kmax12 kmax12 self-requested a review June 3, 2020 18:50
@rwedge
Copy link
Contributor

rwedge commented Jun 3, 2020

Copy link
Contributor

@kmax12 kmax12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here are some comments on the docs. overall looking good and well done with documenting all the details. I wanted to get these comments over ASAP to get the release out, so let me know if anything is confusing and I can clarify or talk through live.

docs/source/guides/dfs_with_dask_entitysets.rst Outdated Show resolved Hide resolved
docs/source/guides/dfs_with_dask_entitysets.rst Outdated Show resolved Hide resolved
docs/source/guides/dfs_with_dask_entitysets.rst Outdated Show resolved Hide resolved
docs/source/guides/dfs_with_dask_entitysets.rst Outdated Show resolved Hide resolved
docs/source/guides/dfs_with_dask_entitysets.rst Outdated Show resolved Hide resolved
@@ -62,3 +62,11 @@ The dashboard requires an additional python package, bokeh, to work. Once bokeh
Parallel Computation by Partitioning Data
-----------------------------------------
As an alternative to Featuretool's parallelization, the data can be partitioned and the feature calculations run on multiple cores or a cluster using Dask or Apache Spark with PySpark. This approach may be necessary with a large ``EntitySet`` because the current parallel implementation sends the entire ``EntitySet`` to each worker which may exhaust the worker memory. For more information on partitioning the data and using Dask or Spark, see :doc:`/guides/performance`. Dask and Spark allow Featuretools to scale to multiple cores on a single machine or multiple machines on a cluster.

Computation with a Dask EntitySet (BETA)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd move this section up to the top and we need to also edit the "Running Featuretools with Spark and Dask" section to align with this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to make sure I fully understand this comment...

So, you want this new Dask section to be the very first section in the document instead of the last?

Can you provide a little more info on what you are thinking in regards to aligning the "Running Featuretools with Spark and Dask" section with this?

Copy link
Contributor

@kmax12 kmax12 Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, you want this new Dask section to be the very first section in the document instead of the last?

Yep

Can you provide a little more info on what you are thinking in regards to aligning the "Running Featuretools with Spark and Dask" section with this?

Right now that section says "If you have a big data problem and are interested in testing our latest Dask or Spark integrations for free, please let us know by completing this simple request form."

However, if they want to use dask, they dont need to fill out the form since dask is now released. They should still fill out the form for spark though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmax12 If we remove the reference to Dask in this section (and the corresponding section in Improving Computational Performance - we also need to update the linked request form which still mentions Dask. Can you update that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, ill do that right after we release! thanks for the reminder

* improve Dask docs

* combine parallel computation and performance guides

* more doc updates

* fix note text
setup.cfg Outdated
@@ -1,7 +1,7 @@
[metadata]
description-file = README.md
[tool:pytest]
addopts = --doctest-modules --ignore=featuretools/tests/plugin_tests/featuretools_plugin
addopts = --doctest-modules --ignore=featuretools/tests/plugin_tests/featuretools_plugin --ignore=featuretools/dask-tests-tmp/
Copy link
Contributor

@rwedge rwedge Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ignore flag is no longer needed (the dask-tests-tmp one)

Copy link
Contributor

@rwedge rwedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@kmax12 kmax12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work. Excited to get this merged and released. Nice job!

@thehomebrewnerd thehomebrewnerd merged commit 996c189 into master Jun 4, 2020
@rwedge rwedge mentioned this pull request Jun 5, 2020
@rwedge rwedge deleted the dask-entity branch February 19, 2021 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants